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Abstract 

We propose a novel approach for nonlinear regression using a two-layer neural network (NN) model 
structure with sparsity-favoring hierarchical priors on the network weights. We present an expec- 
tation propagation (EP) approach for approximate integration over the posterior distribution of the 
weights, the hierarchical scale parameters of the priors, and the residual scale. Using a factorized 
posterior approximation we derive a computationally efficient algorithm, whose complexity scales 
similarly to an ensemble of independent sparse Unear models. The approach enables flexible defini- 
tion of weight priors with different sparseness properties such as independent Laplace priors with a 
common scale parameter or Gaussian automatic relevance determination (ARD) priors with differ- 
ent relevance parameters for all inputs. The approach can be extended beyond standard activation 
functions and NN model structures to form flexible nonlinear predictors from multiple sparse lin- 
ear models. The effects of the hierarchical priors and the predictive performance of the algorithm 
are assessed using both simulated and real-world data. Comparisons are made to two alternative 
models with ARD priors: a Gaussian process with a NN covariance function and marginal maxi- 
mum a posteriori estimates of the relevance parameters, and a NN with Markov chain Monte Carlo 
integration over all the unknown model parameters. 

Keywords: expectation propagation, neural network, multilayer perceptron, linear model, sparse 
prior, automatic relevance determination 



1. Introduction 



Gaussian priors may not be the best possible choice for the input layer weights of a feedforward 
neural network (NN) because allowing, a priori, a large weight wj for a potentially irrelevant fea- 
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ture Xj may deteriorate the predictive performance. This behavior is analogous to a linear model 
because the input layer weights associated with each hidden unit of a multilayer perceptron (MLP) 
network can be interpreted as separate linear models whose outputs are combined nonlinearly in 
the next layer. Integrating over the posterior uncertainty of the unknown input weights mitigates 
the potentially harmful effects of irrelevant features but it may not be sufficient if the number of 
input features, or the total number of unknown variables, grows large compared with the number of 
observations. In such cases, an alternative strategy is required to suppress the effect of the irrelevant 
features. In this article we focus on a two-layer MLP model structure but aim to form a more general 
framework that can be used to combine linear models with sparsity-promoting priors using general 
activation functions and interaction terms between the hidden units. 

A popular approach has been to apply hierarchical automatic relevance determination (ARD) 
priors (Mackay, 1995; Neal, 1996), where individual Gaussian priors are assigned for each weight, 
Wj ~ 9\C{0,aij), with separate variance hyperparameters tt;^ controlling the relevance of the group 
of weights related to each feature. Mackay (1995) described an ARD approach for NNs, where 
point estimates for the relevance parameters aij along with other model hyperparameters, such as 
the noise level, are determined using a marginal likelihood estimate obtained by approximate inte- 
gration over the weights with Laplace's method. Neal (1996) proposed an alternative Markov chain 
Monte Carlo (MCMC) approach, where approximate integration is performed over the posterior un- 
certainty of all the model parameters including wj and aij. In connection with linear models, various 
computationally more efficient algorithms have been proposed for determining marginal likelihood 
based point estimates for the relevance parameters (Tipping, 2001; Qi et al., 2004; Wipf and Na- 
garajan, 2008). The point-estimate based methods, however, may suffer from overfitting because 
the maximum a posteriori (MAP) estimate of ttij may be close to zero also for relevant features as 
demonstrated by Qi et al. (2004). The same applies also for infinite neural networks implemented 
using Gaussian process (GP) priors when separate hyperparameters controlling the nonlinearity of 
each input are optimized (Williams, 1998; Rasmussen and Williams, 2006). 

Recently, appealing surrogates for ARD priors have been presented for linear models. These 
approaches are based on sparsity favoring priors, such as the Laplace prior (Seeger, 2008) and the 
spike and slab prior (Hemandez-Lobato et al., 2008, 2010). The methods utilize the expectation 
propagation (EP) (Minka, 2001a) algorithm to efficiently integrate over the analytically intractable 
posterior distributions. Importantly, these sparse priors do not suffer from similar overfitting as 
many ARD approaches because point estimates of feature specific parameters such as a;^. are not 
used, but instead, approximate integration is done over the posterior uncertainty resulting from a 
sparse prior on the weights. Expectation propagation provides a useful alternative to MCMC for 
carrying out the approximate integration because it has been found computationally efficient and 
very accurate in many practical applications (Nickisch and Rasmussen, 2008; Hemandez-Lobato 
et al., 2010). 

In nonlinear regression, sparsity favoring Laplace priors have been considered for NNs, for 
instance, by Williams (1995), where the inference is performed using the Laplace approximation. 
However, a problem with the Laplace approximation is that the curvature of the log -posterior density 
at the posterior mode may not be well defined for all types of prior distributions, such as, the 
Laplace distribution whose derivatives are not continuous at the origin (Williams, 1995; Seeger, 
2008). Implementing a successful algorithm requires some additional approximations as described 
by Williams (1995), whereas with EP the implementation is straightforward since it rehes only on 
expectations of the prior terms with respect to a Gaussian measure. 
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Another possibly undesired characteristic of the Laplace approximation is that it approximates 
the posterior mean of the unknowns with their MAP estimate and their posterior covariance with the 
negative Hessian of the posterior distribution at the mode. This local estimate may not represent well 
the overall uncertainty on the unknown variables and it may lead to worse predictive performance for 
example when the posterior distribution is skewed (Nickisch and Rasmussen, 2008) or multimodal 
(Jylanki et al., 2011). Furthermore, when there are many unknowns compared to the effective 
number of observations, which is typical in practical NN applications, the MAP solution may differ 
significantly from the posterior mean. For example, with linear models the Laplace prior leads to 
strictly sparse estimates with many zero weight values only when the MAP estimator of the weights 
is used. The posterior mean estimate, on the other hand, can result in many clearly nonzero values 
for the same weights whose MAP estimates are zero (Seeger, 2008). In such case the Laplace 
approximation underestimates the uncertainty of the feature relevances, that is, the joint mode is 
sharply peaked at zero but the bulk of the probability mass is distributed widely at nonzero weight 
values. Recently, it has also been shown that in connection with linear models the ARD solution 
is exactly equivalent to a MAP estimate of the coefficients obtained using a particular class of non- 
factorized coefficient prior distributions which includes models that have desirable advantages over 
MAP weight estimates (Wipf and Nagarajan, 2008; Wipf et al., 2011). This connection suggests 
that the Laplace approximation of the input weights with a sparse prior may possess more similar 
characteristics with the point-estimate based ARD solution compared to the posterior mean solution. 

Our aim is to introduce the benefits of the sparse linear models (Seeger, 2008; Hernandez- 
Lobato et al., 2008) into nonlinear regression by combining the sparse priors with a two-layer NN 
in a computationally efficient EP framework. We propose a logical extension of the linear regres- 
sion models by adopting the algorithms presented for sparse linear models to MLPs with a linear 
input layer For this purpose, the main challenge is constructing a reliable Gaussian EP approxima- 
tion for the analytically intractable likelihood resulting from the NN observation model. Previously, 
Gaussian approximations for NN models have been formed using the extended Kahnan filter (EKE) 
(de Freitas, 1999) and the unscented Kalman filter (UKF) (Wan and van der Merwe, 2000). Alterna- 
tive mean field approaches possessing similar characteristic with EP have been proposed by Opper 
and Winther (1996) and Winther (2001). 

We extend the ideas from the UKF approach by utilizing similar decoupling approximations for 
the weights as proposed by Puskorius and Feldkamp (1991) for EKF-based inference and derive a 
computationally efficient algorithm that does not require numerical sigma point approximations for 
multi-dimensional integrals. We propose a posterior approximation that assumes the weights asso- 
ciated with the output-layer and each hidden unit independent. The complexity of the EP updates 
in the resulting algorithm scale linearly with respect to the number of hidden units and they require 
only one-dimensional numerical quadratures. The complexity of the posterior computations scale 
similarly to an ensemble of independent sparse linear models (one for each hidden unit) with one 
additional linear predictor associated with the output layer It follows that all existing methodology 
on sparse linear models (e.g., methods that facilitate computations with large number of inputs) can 
be applied separately on the sparse linear model corresponding to each hidden unit. Furthermore, 
the complexity of the algorithm scales linearly with respect to the number of observations, which is 
beneficial for large datasets. The proposed approach can also be extended beyond standard activa- 
tion functions and NN model structures, for example, by including a hnear hidden unit or predefined 
interactions between the linear input-layer models. 
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In addition to generalizing the standard EP framework for sparse linear models we introduce an 
efficient EP approach for inference on the unknown hyperparameters, such as the noise level and 
the scale parameters of the weight priors. This framework enables flexible definition of different 
hierarchical priors, such as one common scale parameter for all input weights, or a separate scale 
parameter for all weights associated with one input variable (i.e., an integrated ARD prior). For 
example, assigning independent Laplace priors on the input weights with a common unknown scale 
parameter does not produce very sparse approximate posterior mean solutions, but, if required, more 
sparse solutions can be obtained by adjusting the common hyperprior of the scale parameters with 
the ARD specification. We show that by making independent approximations for the hyperparam- 
eters, the inference on them can be done simultaneously within the EP algorithm for the network 
weights, without the need for separate optimization steps which is common for many EP approaches 
for sparse linear models and GPs (Rasmussen and Williams, 2006; Seeger, 2008), as well as com- 
bined EKF and expectation maximization (EM) algorithms for NNs (de Freitas, 1999). Additional 
benefits are achieved by introducing left-truncated priors on the output weights which prevent pos- 
sible convergence problems in the EP algorithm resulting from inherent unidentifiability in the MLP 
network specification. 

The main contributions of the paper can be summarized as: 

• An efficiently scaUng EP approximation for the non-Gaussian likelihood resulting from the 
MLP-model that requires numerical approximations only for one-dimensional integrals. We 
derive closed-form solutions for the parameters of the site term approximations, which can 
be interpreted as pseudo-observations of a linear model associated with each hidden unit 
(Sections 3.1-3.3 and Appendices A-E). 

• An EP approach for integrating over the posterior uncertainty of the input weights and their 

hierarchical scale parameters assigned on predefined weight groups (Sections 3.1 and 3.2.1). 
The approach could be applied also for sparse linear models to construct factorized approxi- 
mations for predefined weight groups with shared hyperparameters. 

• Approximate integration over the posterior uncertainty of the observation noise simultane- 
ously within the EP algorithm for inference on the weights of a MLP-network (see Appendix 
D). Using factorizing approximations, the approach could be extended also for approximate 
inference on other hyperparameters associated with the likelihood terms and could be applied, 
for example, in recursive filtering. 

Key properties of the proposed model are first demonstrated with three artificial case studies 
in which comparisons are made with a neural network with infinitely many hidden units imple- 
mented as a GP regression model with a NN covariance function and an ARD prior (Williams, 
1998; Rasmussen and Williams, 2006). Finally, the predictive performance of the proposed ap- 
proach is assessed using four real-world data sets and comparisons are made with two alternative 
models with ARD priors: a GP with a NN covariance function where point estimates of the rel- 
evance hyperparameters are determined by optimizing their marginal posterior distribution, and a 
NN where approximate inference on all unknowns is done using MCMC (Neal, 1996). 
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2. The Model 

We focus on two layer NNs where the unknown function value fi = f{xi) related to a ^/-dimensional 
input vector x,- is modeled as 

K 

fi^i) = L v^§(WfcX/) + vo, (1) 

where g{x) is a nonlinear activation function, K the number of hidden units, and vo the output 
bias. Vector Wk = [wk,i,Wk,2,---,Wk,d]'^ contains the input layer weights related to hidden unit k and 
Vfe is the corresponding output layer weight. Input biases can be introduced by adding a constant 
term to the input vectors x^. In the sequel, all weights are denoted by vector z = [w^,v^]^, where 
w= [w|",...,w^]^, andv= [vi, ...,v^,vo]'^. 

In this work, we use the following activation function: 

where 4>(x) = fl^f^{t\0, l)dt, and the scaUng by 1/Vk ensures that the prior variance of /(x,) 
does not increase with K assuming fixed Gaussian priors on vt and Wkj- We focus on regression 
problems with Gaussian observation model p{yi\fi,0'^) = ^(yi\fi,0'^), where is the noise vari- 
ance. However, the proposed approach could be extended for other activation functions and obser- 
vations models, for example, the probit model for binary classification. 

2.1 Prior Definitions 

To reduce the effects of irrelevant features, independent zero-mean Laplace priors are assigned for 
the input layer weights: 




where wj is the j:th element of w, and Xij = 2"^/^exp((|);^./2) is a joint hyperparameter controlling 
the prior variance of all input weights belonging to group Ij G {1, that is, Var(wy|X./^.) = iXj,. 

Here index variable Ij defines the group in which the weight wj belongs to. The EP approximate 
inference is done using the transformed scale parameters (|)/ = log(2X^) G M. The grouping of 
the weights can be chosen freely and also other weight prior distributions can be used in place of 
the Laplace distribution (3). By defining a suitable prior on the unknown group scales (|); useful 
hierarchical priors can be implemented on the input layer. Possible definitions include one common 
scale parameter for all inputs (L = 1), and a separate scale parameter for all weights related to each 
feature, which implements an ARD prior {L = d). To obtain the traditional ARD setting the Laplace 
priors (3) can be replaced with Gaussian distributions p{wj\%ij) = ^SC{wj\0,exp{^ij)). When the 
scale parameters {(|)/}f=i are considered unknown, Gaussian hyperpriors are assigned to them: 

^i = \og{2Xj)r^^ifi^fl,olo), (4) 

where a common mean /^^ o and variance q have been defined for all groups Z = 1 , . . . , L. Definition 
(4) corresponds to a log-normal prior on the associated prior variance iXj = exp (([)/) for the weights 
in group 
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Because of the symmetry g{x) = —g{—x) of the activation function, changing the signs of out- 
put weight vjc and the corresponding input weights w^^ results in the same prediction /(x). This 
unidentifiability may cause converge problems in the EP algorithm: if the approximate posterior 
probability mass of output weight Vk concentrates too close to zero, expected values of and 
may start fluctuating between small positive and negative values. Therefore the output weights are 
constrained to positive values by assigning left-truncated heavy-tailed priors to them: 

p{n\<4,o) = ^tAn\0,olo), (5) 

where Vk>0,k=l,...,K, and ?v ( V/t 1 , q ) denotes a Student-? distribution with degrees of freedom 
V, mean zero, and scale parameter o^q. The prior scale is fixed to o^q = 1 and the degrees of freedom 
to V = 4, which by experiments was found to produce sufficiently large posterior variations of /(x) 
when the activation function is scaled according to (2) and the observations are normalized to zero 
mean and unit variance. The heavy-tailed prior (5) enables very large output weights if required, 
for example, when some hidden unit is forming almost a linear predictor (see, e.g. Section 4.2). 
A zero-mean Gaussian prior is assigned to the output bias: p{vo\ol^ q) = C\C{^,<^% o)' where the 
scale parameter is fixed to a^^ o ~ ^ because it was also found to be a sufficient simplification with 
the same data normalization conditions. The noise level is assumed unknown and therefore a 
log-normal prior is assigned to it: 

e = log(a2)-iA^(A/e,o,4o) (6) 

with prior mean fu^ o and variance Oq g. 

The values of the hyperparameters Xi = 2^^/^exp((|)//2) and o^q affect the smoothness proper- 
ties of the model in different ways. In the following discussion we first assume that there is only 
one input scale parameter Xi (L= I) for clarity. Choosing a smaller value for Xi penalizes more 
strongly for larger input weights and produces smoother functions with respect to changes in the 
input features. More precisely, in the two-layer NN model (1) the magnitudes of the input weights 
(or equivalently the ARD scale parameters) are related to the nonlinearity of the latent function 
/(x) with respect to the corresponding inputs x. Strong nonlinearities require large input weights 
whereas almost a linear function is obtained with a very large output weight and very small input 
weights for a certain hidden unit (see Section 4.2 for illustration). 

Because the values of the activation function g(x) are constrained to the interval [—1,1], hy- 
perparameter o^q controls the overall magnitude of the latent function /(x). Larger values of ct^q 
increase the magnitude of the steps the hidden unit activation Vkgiyvjx) can take in the direction of 
weight vector w^; in the feature space. Choosing a smaller value for a^Q can improve the predictive 
performance by constraining the overall flexibility of the model but too small value can prevent the 
model from explaining all the variance in the target variable y. In this work, we keep o^q fixed to 
a sufficiently large value and infer Xi from data promoting simultaneously smoother solutions with 
the prior on (|)/ = log(2A,^). If only one common scale parameter (|)/ is used, the sparsity-inducing 
properties of the prior depend on the shape of the joint distribution p{vf\\) = Ylj p{wj\Xi) resulting 
from the choice of the prior terms (3). By decreasing /Ui^^, we can favor smaller input weight values 
overall, and with o^q, we can adjust the thickness of the tails of ;7(w|A). On the other hand, if 
individual scale parameters are assigned for all inputs according to the ARD setting, a family of 
sparsity-promoting priors is obtained by adjusting /Hi^ q and a^Q. If /Ui^ q is set to a small value, say 
0.01, and oIq is increased, more sparse solutions are favored by allocating increasing amounts of 
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Figure 1: A directed graph representing the joint distribution of all the model parameters written 
in equation (7) resulting from the observation model and prior definitions summarized 
in Section 2. Observed variables indexed with / = 1, are denoted with boxes, unob- 
served random variables are denoted with circles, and fixed prior parameters are denoted 
with dots. For each input x,, / = 1,...,«, two intermediate random variables are visu- 
alized: The linear hidden unit activations defined as hi^k = w^x, and the latent function 
value given by = l^f^j Vkgijii^k) + vo- 



prior probability on the axes of the input weight space. A sparse prior could be introduced also on 
the output weights Vk to suppress redundant hidden units but this was not found necessary in the 
experiments because the proposed EP updates have fixed point at E(v^:) = and E(w<:) = and 
during the iterations unused hidden units are gradually driven towards zero (see Section 3.3.1). 



3. Approximate Inference 



In this section, we describe how approximate Bayesian inference on the unknown model parameters 
w, V, 6, and cf) = [(j)!, can be done efficiently using expectation propagation. First, the 

structure of the approximation and the expressions of the approximate site terms are presented in 
Section 3.1. A general description of the EP algorithm for determining the parameters of the site 
approximations is given in Section 3.2 and approximations for the non-Gaussian hierarchial weight 
priors are described in Section 3.2.1. The various computational blocks required in the EP algorithm 
are discussed first in Section 3.3 and detailed descriptions of the methods are given in Appendices 
A-E. Finally, an algorithm description with references to the different building blocks is given in 
3.3.1. 
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3.1 The Structure of the Approxunation 

Given a set of n observations 'D = {X,y}, where y = [yi,...,y„]^, X = [xi, ...,x„]^, the posterior 
distribution of all the unknowns is given by 

n Kd K L 

;7(z,e,0i®)=z"inp(3'«i^^e)n^Ki^)n^(^'tKo)n^(*'M9)' c^) 

(=1 7=1 k=0 1=1 

where cj^q ~ {^vO'^^vo ol aiidZ = /7(y|X,ajQ) is the marginal likelihood of the observed data condi- 
tioned on all fixed hyperparameters (in this case q). Figure 1 shows a directed graph representing 
the joint distribution (7) where we have also included intermediate random variables hi^k = wjx,- 
and fi related to each data point to clarify the upcoming description of the approximate inference 
scheme. To form an analytically tractable approximation, all non-Gaussian terms are approximated 
with unnormalized Gaussian site functions, which is a suitable approximating family for random 
vectors defined in the real vector space. 

We first consider different possibihties for approximating the hkeUhood terms p(j(|/i,9) which 
depend on the unknown noise parameter 6 = log and the unknown weight vectors w and v through 
the latent function value fi according to: 

/,- = vTg(xTw)=v^g(hO, (8) 



where x,- = \k® x, is a Kd x K auxiliary matrix formed as Kronecker product. It can be used to 
write all the linear input layer activations h, of observation x, as h,- = h(x;) = x^w. The vector 
valued function g(h,) applies the nonlinear transformation (2) on each component of h, according 
to g(h,) = [g(h/ i),g(h/2), •••,g(h,.A:), 1]^ (the last element corresponds to the output bias vq). If we 
approximate the posterior distribution of all the weights z = [w^,v^]^ with a multivariate Gaussian 
approximation that is independent of all the other unknowns including 0, the resulting EP algorithm 
requires fast evaluation of the means and covariances of tilted distributions of the form 

Pi{^) - p(y,-|v^g(h,-),e)5V:(z|iU„5],), (9) 



where Hj, is a known mean vector, and a known covariance matrix, and 6 is assumed fixed. 
Because the non-Gaussian likelihood term depends on z only through linear transformation hj, it 
can be shown (by differentiating (9) twice with respect to ^ij) that the normalization term, mean 
and covariance of pi{i) can be exactly determined by using the corresponding moments of the 
transformed lower dimensional random vector u; = B^z = [w^x,,v^]^, where the transformation 
matrix B, can be written as 

X, 
I^+i 



(10) 



This results in significant computational savings because the size of B,- is x d^, where we have 
denoted the dimensions of u,- and z with du = 2K+l and = Kd + K+1 respectively. It follows 
that the EP algorithm can be implemented by propagating the moments of u, using, for example, the 
general algorithm described by Cseke and Heskes (201 1, appendix C). The same principle has been 
utilized to form computationally efficient algorithms also for linear binary classification (Minka, 
2001b; Qi et al, 2004). 
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Independent Gaussian posterior approximations for both z and 9 can be obtained by approxi- 
mating the likelihood terms by a product of two unnormalized Gaussian site functions: 

p{yi\fhQ) ^ Zy^itzMkii.^)^ 

where Zy^i is a scalar scahng parameter. Because of the previously described property, the first 
likelihood site approximation depends on z only through transformation B-^z (Cseke and Heskes, 
2011): 

= exp ^z'BifiBjz + z^B^b,-^ , (11) 

where b; is a (i„ x 1 vector of location parameters, and T, a du x du site precision matrix. The second 
likelihood site term dependent on the scalar = loga^ is chosen to be an unnormalized Gaussian 

km = exp (-^^eje^ - f^Cme,i,oli), (12) 

where the site parameters JuQ^i and ^ control the location and the scale of site function, respec- 
tively. Combined with the known Gaussian prior term on 6 this results in a Gaussian posterior 
approximation for 6 that corresponds to a log-normal approximation for o^. 

The prior terms of the output weights Vk, for k = l,...,K, axe approximated with 

P{vk\<3lo) ~ Zv,ktv,k{vk) °^ ^{vk\Mv,k,^lk)^ (13) 

where Zyjt is a scalar scaling parameter, tv,k{vk) has a similar exponential form as (12), and the site 
parameters fiy^k and a^^ control the location and scale of the site, respectively. If the prior scales 
(|)/ are assumed unknown, the prior terms of the input weights {wj\j = l,...,Kd}, are approximated 
with 

P{'wMlj)^^wjtw,j{Wj)t^j{(\>lj) °- iSC{Wj\fiwJ,^ljW<\>lj\f^J,^lj), (14) 

where a factorized site approximation with location parameters fi^j and fii^j, and scale parameters 
G^j and j, is assumed for weight wj and the associated scale parameter (|)/^ , respectively. A 
similar exponential form to equation (12) is assumed for both twj{wj) and ti^j{(^ij). This factorizing 
site approximation results in independent posterior approximations for w and each component of 4> 
as will be described shortly. 

The actual numerical values of the normalization parameters Zyj, Zyjt, and Z^.j are not required 
during the iterations of the EP algorithm but their effect must be taken into account if one wishes 
to form an EP approximation for the marginal likelihood Z = p(y|X) (see Appendix G). This esti- 
mate could be utiUzed to compare models or to alternatively determine type-II MAP estimates for 
hyperparameters 6, {<|)/}f=i, and a^g they are not inferred within the EP framework. 

3.1.1 Fully-coupled approximation for the network weights 

Multiplying the site approximations together according to (7) and normalizing the resulting expres- 
sion gives the following Gaussian posterior approximation: 

L 

p{z,e,4>\'D,olo)^qiz)q{e)Ylq{<^l), (15) 

1=1 
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where ^(z) = fA^(z|/i,E), ^(9) = fA^(0|/^,a^), and q{(^i) = fA^ ((|)/ , J are the approximate pos- 
terior distributions of the weights z, the noise parameter 9 = loga^, and the input weight scale 
parameter (|)/, respectively. The mean vector and covariance matrix of ^(z), are given by 

^=(^^i%^J + ^o'^ and /i = 5]|^|;B,b; + 5]oVo^ , (16) 

where the parameters of the prior term approximations (13) and (14) are collected together in 5]o = 
diag([o^i,...,a^^^^,a2j,...,a2^]) and fxo = \Mw,u-;fiw,Kd,fiv,u-,iJv,KV- The parameters of ^(6) 
are given by 

= ^L^e,? + <^e,o^ and /le = a^,o (^^ej^^,i + <^ei^^,o^ ■ (17) 

The approximate mean and variance of q{^i) can be computed analogously to (17). The key property 
of the approximation (15) is that if we can incorporate the information provided by the data point 
ji in the parameters T, and b„ for all / = 1 , . . . , n, the approximate inference on the non-Gaussian 
priors p{vk) and p{wkj) is straightforward by adopting the methods described by (Seeger, 2008). In 
the following sections we will show how this can be done by approximating the joint distribution 
of /,, h; and v, given y_/ = [j] , , with a multivariate Gaussian and using it to 

estimate the parameters and b; one data point at a time within the EP framework. 

3.1.2 Factorizing approximation for the network weights 

A drawback with the approximation (16) is that the evaluation of the covariance matrix S scales as 
0{mm{Kd + K + which may not be feasible when the number of inputs d is large. Another 

difficulty arises in determining the mean and covariance of U; = B,z = [h^, v^]^ when z is distributed 
according to (9) because during the EP iterations has similar correlation structure with S. If 
the observation model is Gaussian and 6 is fixed, this requires at least X'-dimensional numerical 
quadratures (or other alternative approximations) that may quickly become infeasible as K increases. 
By adopting suitable independence assumptions between v and the input weights associated 
with the different hidden units, the mean and covariance of u, can be approximated using only 
1 -dimensional numerical quadratures as will be described in Section 3.3. 

The structure of the correlations in the approximation (16) can be studied by dividing T, into 
four blocks as follows: 

Thih; Th,y 



(18) 



where Th,h, is a K x K matrix, Th,y a K x K + I matrix, and T/,vv aA'-l-l x K + 1 matrix. The 
element [Thih;];t,;f contributes to the approximate posterior covariance between Wk and w^:', and the 
sub-matrix Th,v contributes to the approximate covariance between w and v. To form an alternative 
computationally more efficient approximation we propose a simpler structure for T,-. First, we ap- 
proximate Thjh, with a diagonal matrix, that is, Th,h, = diag(Ti), where only the A::th component of 
the vector r, contributes to the posterior covariance of Wk. Secondly, we set Th^v = and approx- 
imate T,- vv with an outer-product of the form T, vv = &iaj. With this precision structure the site 
approximation (1 1) can be factorized into terms depending only on the output weights v or the input 
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weights Wk associated with the different hidden units k=l,...,K: 



t^,i{z) = exp i (a[v)2 + ^^P (^~%k{xjv/kf + Vi,iwjx;^ (19) 

K 

=?v,i(v)nwK), 



where the site location parameters V; jt now correspond to the first K elements of b, in equation (11), 
that is, i>i = [v; 1, ...V; ;f]^ = [b, 1, ...,b, /f]^. Analogously, the site location vector /3; corresponds to 
the last K + 1 entries of b,-, that is, /3j = [hi^K+i b/,2^+1 

The factored site approximation (19) results in independent posterior approximations for the 
output weights v and the input weights w^t associated with different hidden units: 

K 

^(z)=^(v)n^(w^), (20) 

k=l 

where ^(v) = Cf^{fiv, "^v) and ^(w^;) = ^^C{|J'Wk^'^■wk)■ The approximate mean and covariance of Wk 
is given by 

Ew,= (xTt^,X + 5:^io) ' and fi^, = (x" i>^, + J^^l^fx^.^o) , (21) 

where the diagonal matrix T^^^ = diag(7v^) and the vector i>wj collect all the site parameters related 
to hidden unit k: fw^ = [^i,ki ■■■,^n,kV and i>^^ = [Vijt, ■.■,Vn,kV- The diagonal matrix and the 
vector /iwj^.o contain the prior site scales a^ ^ and the location variables related to the hidden 
unit k. The approximate mean and covariance of the output weights v are given by 

= (^L + ^yfij ai^d ^ly = A + ^ V ' A^v.o j , (22) 

where the diagonal matrix and the vector fiy Q contain the prior site scales d^j and the location 
variables fivj- 

For each hidden unit k, approximations (20) and (21) can be interpreted as independent linear 
models with Gaussian likelihood terms 9{^{yi^ic\xjwk,t^i^), where the pseudo-observations are given 
by yi,k = '^~ik^i,k- The approximation for the output weights (22) has no exphcit dependence on the 
input vectors x,- but later we will show that the independence assumption results in a similar depen- 
dence on expected values of g, taken with respect to approximate leave-one-out (LOO) distributions 
of w and v. 

3.2 Expectation Propagation 

The parameters of the approximate posterior distribution (15) are determined using the EP algo- 
rithm (Minka, 2001a). The EP algorithm updates the site parameters and the posterior approxima- 
tion q{i,Q,(f)) sequentially. In the following, we briefly describe the procedure for updating the 
likelihood sites 4 ; and fe , and assume that the prior sites (13) and (14) are kept fixed. Because the 
likelihood terms p{yi\fijQ) do not depend on and posterior approximation is factorized, that is 
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q(z,B,cl)) = q{z)q{Q)q{cj)), we need to consider only the approximations for z and 9. Furthermore, 
independent approximations for and v are obtained by using (19) and (20) in place ?z,, and q{z), 
respectively. 

At each iteration, first a proportion r\ of the j:th site term is removed from the posterior approx- 
imation to obtain a cavity distribution: 

q^iiz,d) = q^iiz)q^i{Q) oc ^(z)^(e)4,,•(z)-^^ei(e)"^ (23) 

where r\ G (0, 1] is a fraction parameter that can be adjusted to implement fractional (or power) EP 
updates (Minka, 2004, 2005). When rj = 1, the cavity distribution (23) can be thought of as a LOO 
posterior approximation where the contribution of the j:th Ukelihood term p{yi\fi,Q) is removed 
from q{z, 0). Then, the j:th site is replaced with the exact UkeUhood term to form a tilted distribution 

Piiz,e) =Zri^_,(z,e)p(y,|z,e,x)\ (24) 

where Z, is a normalization constant, which in this case can also be thought of as an approximation 
for the LOO predictive density of the excluded data point y, . The tilted distribution can be regarded 

as a more refined non-Gaussian approximation to the true posterior distribution. Next, the algo- 
rithm attempts to match the approximate posterior distribution ^(z,6) with Pi{z,Q) by finding first 
a Gaussian ^,(z, 9) that satisfies 

qi{z, 9) = argmin^. KL {pi{z, 9) \\qi{z, 9)) , 

where KL denotes the Kullback-Leibler divergence. When q{z, 9) is chosen to be a Gaussian dis- 
tribution this is equivalent to determining the mean vector and the covariance matrix of pi and 
matching them with the mean and covariance of qi. Then, the parameters of the j:th site terms are 
updated so that the moments of q{z, 9) match with q{z, 9): 

^,-(z,e) = 9(z,9) oc q^i{z)q^i{Q%,i{zy^hM'^. (25) 

Finally, the posterior approximation q{z, 9) is updated according to the changes in the site parame- 
ters. These steps are repeated for all sites in some suitable order until convergence. 

From now on, we refer to the previously described EP update scheme as sequential EP. If the 
update of the posterior approximation q{z, 9) in the last step is done only after new parameter values 
have been determined for all sites (in this case the n likelihood term approximations), we refer to 
parallel EP (see, e.g., van Gerven et al., 2009). Because in our case the approximating family 
is Gaussian and each UkeUhood term depends on a linear transformation of z, one sequential EP 
iteration requires (for each of the n sites) either one rank- (2^ + 1) covariance matrix update with the 
fully-correlated approximation (16), or ^ + 1 rank-one covariance matrix updates with the factorized 
approximation (21, 22). In parallel EP these updates are replaced with a single re-computation of 
the posterior representation after each sweep over all the n sites. In practice, one parallel iteration 
step over all the sites can be much faster compared to a sequential EP iteration, especially iid oxK 
are large, but parallel EP may require larger number of iterations for overall convergence. 

Setting the fraction parameter to r] = 1 corresponds to regular EP updates whereas choosing a 
smaller value produces a sUghtly different approximation that puts less emphasis on preserving all 
the nonzero probability mass of the tilted distributions (Minka, 2005). Consequently, setting x\<\ 
tries to represent possible multimodalities in (24) but ignores modes far away from the main proba- 
bility mass, which results in tendency to underestimate variances. However, in practice decreasing 
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r\ can improve the overall numerical stability of the algorithm and alleviate convergence problems 
resulting from possible multimodalities in case the unimodal approximation is not a good fit for the 
true posterior distribution (Minka, 2005; Seeger, 2008; Jylanki et al, 2011). 

There is no theoretical convergence guarantee for the standard EP algorithm but damping the site 
parameter updates can help to achieve convergence in harder problems (Minka and Lafferty, 2002; 
Heskes and Zoeter, 2002).' In damping, the site parameters are updated to a convex combination 
of the old values and the new values resulting from (25) defined by a damping factor 5 G (0, 1]. For 
example, the precision parameter of the Ukehhood site term f^^ ; is updated as x, ,t = (1 ~ ^)^tk ~^ 
6t"|"' and a similar update using the same 6- value is done on the corresponding location parameter 
Vj jt- The convergence problems are usually seen as oscillations over iterations in the site parameter 
values and they may occur, for example, if there are inaccuracies in the tilted moment evaluations, 
or if the approximate distribution is not a suitable proxy for the true posterior, for example, due to 
multimodaUties. 

3.2.1 EP APPROXIMATION FOR THE WEIGHT PRIOR TERMS 

Assuming fixed site parameters for the likelihood approximation (19), or (11) in the case of full 
couplings, the EP algorithm for determining the prior term approximations (13) and (14) can be 
implemented in the same way as with sparse linear models (Seeger, 2008). 

To derive EP updates for the non-Gaussian prior sites of the output weights v assuming the fac- 
torized approximation, we need to consider only the prior site approximations (13) and the approxi- 
mate posterior ^(v) = fA^(v|//v, Sv) defined in equation (22). All approximate posterior information 
from the observations (D = {y, X} and the priors on the input weights w are conveyed in the parame- 
ters {oti,$i}"^^ determined during the EP iterations for the likelihood sites. The EP implementation 
of Seeger (2008) can be readily applied by using ctiC^J and X!" 0i as a Gaussian pseudo likelihood 
as discussed in Appendix E. Because the prior terms /?(vyt|a^Q) depend only on one random variable 
Vk, deriving the parameters of the cavity distributions q-k{vk) °^ (l{vk)tv,k{vk\ftv,k,^l k) "^ updates 
for the site parameters Ju^^k and require only manipulating univariate Gaussians. The moments 
of the tilted distribution pk{vk) °= q-k{vk)p{vk\<^lo)^ can be computed either analytically or using 
a one-dimensional numerical quadrature depending on the functional form of the exact prior term 

To derive EP updates for the non-Gaussian hierarchical prior sites of the input weights w as- 
suming the factorized approximation (20), we can consider the approximate posterior distributions 
^(wyt) = 9^{wk\fJ'Wi,'^w,,) from equation (21) separately with the corresponding prior site approx- 
imations (14) related to the d components of w^;. All approximate posterior information from the 
observations y is conveyed by the site parameters {t„j.,i>wj:} that together with the input features 
form a Gaussian pseudo likelihood with a precision matrix X^diag(Tw^)X and location term X^i/^t 
(compare with equation 21). It follows that the EP implementation of Seeger (2008) can be applied 
to update the approximations q{v/k) but, in addition, we have to derive site updates also for the scale 
parameter approximations q{'^ij). EP algorithms for sparse linear models that operate on exact site 
terms depending on a nonlinear combination of multiple random variables have been proposed by 
Hemandez-Lobato et al. (2008) and van Gerven et al. (2009). 

1. Alternative provably convergent double-loop algorithms exist but usually they require more computational effort in 
the form of additional inner-loop iterations (Minka, 2001a; Heskes and Zoeter, 2002; Opper and Winther, 2005; 
Seeger and Nickisch, 2011). 
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Because the j:th exact prior term (3) depends on both the weight wj and the corresponding log- 
transformed scale parameter (|)/^ , the j:th cavity distribution is formed by removing a fraction r| of 
both site approximations t^^,.j{wj) and y((|)/^): 

<l-ji'WjAlj) = q-jiwj)q-j{<^lj) °< q{wj)q{(^ij)iw,jiwj)-\j{<^ij)-'^, (26) 

where q{wj) is the y:th marginal approximation extracted from the corresponding approximation 
qiwk), and the approximate posterior for is formed by combining the prior (4) with all the site 
terms ) that depend on (|)/,.: 

Kd 

The y:th tilted distribution is formed by replacing the removed site terms with a fraction x\ of the 
exact prior term p{wj\^ij): 

pMjA) =^wl^-j{'Wj)q-j{^lj)p{wj\^ij)'^ = q{wj,^ij), (27) 

where q{wj,i^ij) is a Gaussian approximation formed by determining the mean and covariance of 
Pj{wj,<^ij). The site parameters are updated so that the resulting posterior approximation is consis- 
tent with the marginal means and variances of q{wj,<^ij): 

4j{yyj)mij) = ?-,-K-)?-y(*/,)?wjK-)^f<^j(^,,.)^- (28) 

Because of the factorized approximation, the cavity computations (26) and the site updates (27) 
require only scalar operations similar to the evaluations of q-i{hi^k) and to the updates of {t/, in 
equations (31) and (46) respectively (see Appendix A and E). 

Determining the moments of (27) can be done efficiently using one-dimensional quadratures 
if the means and variances of wj with respect to the conditional distribution pj{wj\^ij) can be 
determined analytically. This can be readily done when p{wj\(Sfi.) is a Laplace distribution or a 
finite mixture of Gaussians. Note also that if we wish to implement an ARD prior we can choose 
simply p{wj\^i.) = 9\i{wj\0,^i.), where ^i- is a common scale parameter for all weights related to 
the same input feature, that is, weights {wj,Wj^d, ■■■,Wj^(^K-i)d}' for ^ach j G {1,2, ...,d}, share the 
same scale ^j. The marginal tilted distribution for (|)/^. is given by 

p{(^lj)=Z^I J ^_j(W;)^-X*;;)/^KI*/;)'^^^VV; =Z^/Z(^;.,ri)^_/^;.) 

where it is assumed that Z((|)/^,ri) = / q^j{wj)p{wj\^ij)^dwj can be calculated analytically. The 
normalization term Z^j^, the marginal mean j^^ij, and the variance d^^. can be determined using 
numerical quadratures. 

The marginal tilted mean and variance of wj can be determined by integrating numerically the 
conditional expectations of wj and wj over pj{^ij): 

E(w;)=2^/y" WjPj{wj\(^ij)Z{(^i.,r[)q-j{(^i.)dwjd(^i^ = J E{wj\(^i.,ii\)pj{(^ij)d(^ij 
Var(w,) = I E{wj\<^i.,^)pj{<^i.)d^i.-E{wj)\ (30) 
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where pj{wj\i^ij) = Z{i^ij,r\y^q^j{wj)p{wj\(^j^)^, and it is assumed that the conditional expecta- 
tions E(H'y|(|)/^ ,r|) and E(H'y|(|)/^ ,ri) can be calculated analytically. For prior distributions p{wj\<^ij) 
belonging to the exponential family, the exponentiation with r] results in a distribution of the same 
family multiplied by a function of r\ and . Evaluating the marginal moments according to equa- 
tions (29) and (30) requires a total of five one-dimensional quadrature integrations but in practice 
this can be done efficiently by utilizing the same function evaluations of Pj{'^ij) and taking into 
account the prior specific forms of E(w^|(|)/^,ri) andE(wj|(|)/j.,ri). 

3.3 Implementing the EP Algorithm 

In this section, we describe the computational implementation of the EP algorithm resulting from the 
choice of the approximating family described in Section 3.1. Because the non-Gaussian likelihood 
term in the tilted distribution (24) depends on z = [w^,v^]^ only through the linear transformation 
u,- = [hT,v^]^ = B^z, the EP algorithm can be more efficiently implemented by iteratively deter- 
mining and matching the moments of the lower-dimensional random vector u, instead of z (Cseke 
and Heskes, 201 1, appendix C). The computations can be further facilitated by using the factorized 
approximation (20): Because the hidden activations /i,jr, = x^w^; related to different hidden units 
k = I, ...,K axe independent of each other and v, it is only required to propagate the marginal means 
and covariances of ht^k and v to determine the new site parameters. This results in computation- 
ally more efficient determination of the cavity distributions (23), the tilted distributions (24), and 
the new site parameter from (25). Details of the computations required for updating the likelihood 
site approximations are presented in Appendices A-E. The main properties of the procedure can be 
summarized as: 

• Appendix A presents the formulas for computing the parameters of the cavity distributions 
(23). The factorized approximation (20) leads to efficient computations, because the cavity 
distribution can be factored as <7-,(z) = ^-,( v) nf=i (i-ii^k)- The parameters of q^i{hi^k) re- 
sulting from the transformation /z,- jt = x/^w^: can be computed using only scalar manipulations 
of the mean and covariance of q{hi^k) = IJ'Wk^'^'^vik^i)- Because of the outer-product 
structure of ?v,/(v) in equation (19), also the parameters of ^-;(v) can be computed using 
rank-one matrix updates. 

• Appendix B describes how the marginal mean and covariance of v with respect to the tilted 
distribution (24) can be approximated efficiently using a similar approach as in the UKF 
filter (Wan and van der Merwe, 2000). Because of the factorized approximation (20) only 
one-dimensional quadratures are required to compute the means and variances of g{hi^k) with 
respect to q-i{hi^k) and no multivariate quadrature or sigma-point approximations are needed. 

• Appendix C presents a new way to approximate the marginal distribution of Pi{hi_k) result- 
ing from (24). In preliminary simulations we found that a more simple approach based 
on the unscented transform and the approximate linear filtering paradigm did not capture 
well the information from the left-out observation >•,. This behavior was more problematic 
when there was a large discrepancy between the information provided by the likelihood term 
through the marginal tilted distribution Pi{yi\hi^k) = S p{yi\fi,^)'^q-i{y)q-i'^i-k)dyd\-k 
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and the cavity distribution q-i{hi^k), where \ii-k includes all components of h,- except hf^? 
The improved numerical approximation of Pi{hi^k) is obtained by approximating q-i{fi\hik), 
that is, the distribution of the latent function value /,■ = 'Y^k=iS{hi,k) + vo resulting from 
q-i{hi^-k,y\hi^k) = (l-i{'^)Y\k' =^k(l-i{hi-k)^ with a Gaussian distribution and carrying out the 
integration over fi analytically. According to the central Umit theorem we expect this approx- 
imation to get more accurate as K increases. 

• Appendix D generalizes the tilted moment estimations of Appendices B and C for approx- 
imate integration over the posterior uncertainty of 6 = loga^. Computationally convenient 
marginal mean and covariance estimates for v, {hi^k}f=\^ and 6 can be obtained by assuming 
an independent posterior approximation for 6 and making a similar Gaussian approximation 
for q-i{fi) as in Appendix C. Compared to the tilted moments approximations of v and h, 
with fixed 6, the approach requires five additional univariate quadratures for each UkeUhood 
term, which can be facilitated by utilizing the same function evaluations. 

• Appendix E presents expressions for the new site parameters obtained by applying the results 
of Appendices A-D in the moment matching condition (25). Because of the factorization 
assumption in (20) and the UKF-style approximation in the tilted moment estimations (Ap- 
pendix B), the parameters of the likelihood term approximations related to v can be written 
as &i = mg,a7; and /3,- = m^.a^Jfiyj, where [mg,]*. = f g{hk)q-i{hk)dha and fjyj can be 
interpreted as Gaussian pseudo-observations with noise variances (compare with equa- 
tion (42) and (43)). By comparing the parameter expressions with (22), the output-layer 
approximation ^(v) can be interpreted as a linear model where the cavity expectations of the 
hidden unit outputs g{hi k) = g{v/Jxi) are used as input features. The EP updates for the site 
parameters T,,yt and ^i,k related to the input weight approximations q{wk) require only scalar 
operations similarly to other standard EP implementations (Minka, 2001b; Rasmussen and 
Williams, 2006). 

Appendix F describes how the predictive distribution p{y*\x^,) related to a new input vector x* 
can be approximated efficiently using ^(v), {?(wyt)}^j, and ^(0). Note that the prior scale approx- 
imations {q{<^i)}f^i are not needed in the predictions because information from the hierarchical 
input weight priors is approximately absorbed in {^(wyt)}f^j during the EP iterations. Appendix G 
shows how the EP marginal likelihood approximation, logZgp ~ logp(y|X), conditioned on fixed 
hyperparameters (in this case q), can be computed in a numerically efficient and stable man- 
ner. The marginal likelihood estimate can be used to monitor convergence of the EP iterations, to 
determine marginal MAP estimates of the fixed hyperparameters, and to compare different model 
structures. 

3.3.1 General algorithm description and other practical considerations 

Algorithm 1 collects together all the computational components described in Section 3.2.1 and 
Appendices A-E into a single EP algorithm. In this section we will discuss the initialization, the 
order of updates between the different term approximations, and the convergence properties of the 
algorithm. 

2. The UKF approach was found to perform better with smaller values r| because then a fraction of the site approx- 
imation from the previous iteration is left in the cavity, which can reduce the possible multimodality of the tilted 
distribution. 
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Algorithm 1: An EP algorithm for a two-layer MLP-network with non-Gaussian hierarchical 
priors on the weights. 

Initialize approximations /ly, Sy, {/iwt,5]wjf=i, Afe, CJq, and {A'(|),,<7|,}f=i using X, 

{fi,i>i,ai,0i,fk,i,cil i}'^^i, {fiwj,cilj,i^j,clj}f£i, and {iiv,k,^li,}f^i. 

repeat 

if sufficient convergence in {fi,i>i,&i,0i,/UQ^i,dli}1^i and {iiv,k,olk}k=i then 

Run the EP algorithm to update the parameters {fiw,j,^w,j^f^,ji^^j}f=i °f the prior 
site approximations (14) associated with the input weights w and the scale 
parameters (p (Section 3.2.1). 

end 

Loop over the non-Gaussian UkeUhood terms: 
for / ^ 1 to « do 

Compute the means and covariances of the cavity distributions: {(l-i{hi,k)}k=i ^^'^ 
q-i{\) using equations (31) and (32). 

If 6 unknown, compute the cavity distribution ^_/(6) (Appendix A). 
Compute the means and covariances of the tilted distributions qi(y) = !AC{p'i,'^i) and 
qi{hj) = ^{m,Vi) for k=l,...,K: 
If 6 known, use (33) and (36). 

Otherwise, use (38), (39), and (41), and compute ^,(6) = from (37). 

Update the site parameters r,-, i>i, a,-, /3,- using (44), (45), and (46). 
If 6 unknown, update fiQj, Oq ■. 
if sequential updates then 

Rank-1 updates for {q{y/k)}k=i according to the changes in {'ii,k,^i,k}k=i- 
If 6 unknown, update the mean and covariance of ^(6). 
end 

end 

if parallel updates then 

Recompute the posterior approximations {q{y/k)}k=i using {t,,i>,}"^j. 
If 6 unknown, recompute the mean and covariance of ^(0). 

end 

Update ^(v) using {ct,, Alf^j. 
if sufficient data fit then 

Run the EP algorithm to update the parameters {fiv,k^^^](}k=\ the prior site 
approximations (13) related to the output weights v (Section 3.2.1). 

end 

until convergence or maximum number of iterations exceeded 



In practice, we combined the EP algorithms for the likelihood sites (19) and the weight prior 
sites of V (13) and w (14) by running them in turn (in respective order, see lines 2-7, 8, and 1 in 
Algorithm 1). Because all information from the observations y is conveyed by the likelihood term 
approximations, it is sensible to iterate first the parameters r,, i>,, and /3, to obtain a good data fit 
while keeping the prior term approximations (13) and (14) fixed so that all the output weights remain 
effectively positive and all the input weights have equal prior distributions. Otherwise, depending 
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on the scales of the priors, many hidden units and input weights could be effectively pruned out of 
the model by the prior site parameters {fjv,k,^lk}f=[ and {jii„j,a^. j,jii,^j,G^ j^J=y '- ^'^^ example, the 
prior means ju^j would push the approximate marginal distribution q{wj) towards zero and the scale 
parameter a^y would adjust the variance of q(wj) to the level reflecting the fixed scale prior defi- 
nition p{^ij) = fA^(jU(|),o,<5^o)- During the iterations, the data fit can be assessed by monitoring the 
convergence of the approximate LOO predictive density logZLoo = lli^ogp{yi\y^i,X) « Lflog^j 
that usually increases steadily in the beginning of the learning process as the model adapts to the ob- 
servations y. In contrast, the approximate marginal likelihood logZEp « logp(y|X) depends more 
on the model complexity and usually fluctuates more during the learning process because many 
different model structures can produce the same predictions. 

We initialized the algorithm with 10-20 iterations over the likelihood sites with 9 fixed to a 
sufficiently small value, such as = exp(0) = 0.3^, and the input weight priors set to fi^.j = 
and dl,j = 0.5, where we have assumed that the target variables y and the columns of X containing 
the input variables are normalized to zero mean and unit variance. For the input bias term (the 
last column of X), a larger variance = 2^ can be used so that the network is able to produce 
step functions at different locations of the input space. The prior means of the output weights 
fiv^k were initialized with linear spacing in some appropriate interval, for example [1,2], and the 
prior variances were set to sufficiently small values such as G^^. = 0.2^ so that the elements of the 
approximate mean vector /Xy remain positive during the initial iterations. 

We initialized the parameters {r,, i>;, q:,,;9;}"^j to zero, which means that in the beginning all 
hidden units produce zero expected outputs mg_ = resulting into zero messages to the output 
weights V in equations (42) and (43). However, because of the initialization of juy^k and a^^, the initial 
approximate means of the output weights [iJ,y]k = Mv,k will be positive and nonidentical. It follows 
that different nonzero messages will be sent to the input weights according to (46) because the tilted 
moments m,j: and Vi^t from (36) will differ from the corresponding marginal moments = xT/x^;; 
and Vi^k = xJ^Sw^Xf- If in the beginning all the hidden units were updated simultaneously with the 
same priors for the output weights, they would get very similar approximate posteriors. In this 
case all the computational units do more or less the same thing but sufficiently many iterations 
would eventually result in different values for all the input weight approximations q{v/k)- This 
learning process can be accelerated by the previously described linearly spaced prior means juyj or 
by updating only one hidden unit in the beginning and increasing the number of updated units one 
by one after every few iterations. The rationale behind the latter incremental scheme is that the first 
unit will usually explain the dominant hnear relationships between x and y and the remaining units 
will fit to more local nonlinearities. 

The positive Gaussian output weight priors defined at the initialization of juy^k and can be 
relaxed after the initial iterations by running the EP algorithm on the term approximations (13) for 
the truncated prior terms (5) (line 8 in Algorithm 1). The EP updates for the observation noise 9 can 
be started after the initial iterations (lines 2-5 in Algorithm 1). We initialized the site parameters 
{/ie,(,^0 - j^^j to zero, and at the first iteration for 9 we also kept parameters r,, and 0i 

fixed so that the initial fluctuations of /ie,/ and Og • do not affect the approximations ^(v) and q{vfic). 
After sufficient convergence is obtained in the EP iterations on the parameters of the likelihood sites 
{fi,i>i,a.i,0i,fiej,al J"^; and the parameters of the output weight prior sites {fiv,k,^lk}f=i, the EP 

updates can be started on the parameters {fiwj,^w,jif^Ji^'^ j^f=i P'^^^'^ approximations 

(14) (line 1 in Algorithm 1). 
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If all the prior term approximations together with {r, , are kept fixed, that is, q{wk) are not 

updated, the EP algorithm for the parameters a,- and /3/ related to q(\) converges typically in 5-10 
iterations. In addition, if tv^^ and i>y,^ related to only one hidden unit k are updated, the algorithm 
will typically converge in less than 10 iterations. The fast convergence is expected because in both 
cases the iterations can be interpreted as a standard EP algorithm on a linear model with known input 
variables. However, updating only one hidden unit at a time will induce moment inconsistencies 
between the approximations and the corresponding tilted distributions of the other K—l hidden unit 
activations hi^k and the output weights v. This means that such update scheme would require many 
separate EP runs for each hidden unit and v to achieve overall convergence and in practice it was 
found more efficient to update all of them together simultaneously with a sufficient level of damping. 
The updates on d,- and /3,- were damped more strongly by 8 G 0.2 so that subsequent changes in ^(v) 
would not inflict unnecessary fluctuations in the parameters of ^(w^;), which are more difficult to 
determine and converge more slowly compared with ^(v). In other words, we wanted to change the 
output weight approximations more slowly so that messages have enough time to propagate between 
the hidden units. For the same reason, on the hne 7 of Algorithm 1 parallel updates are done on 
^(v) whereas the user can choose between sequential and parallel updates for q(\Vk) (fines 5 and 6). 
With sequential posterior updates for ^(w^), damping the updates of r,- and i>/ with 6 G [0.5, 0.8] was 
found sufficient whereas with parallel updates 5 < 0.5 was often required. If there are large number 
of input features, it may be more efficient to use parallel updates for q{vfk) with larger amount of 
damping in a similar framework as described by van Gerven et al. (2009). 

The EP updates for the prior terms of v and w^; are computationally less expensive and converge 
faster compared with the likelihood term approximations. With fixed values of {f;-,i>;,Q:,,^/}"^j 
typically 5-10 iterations were required for convergence of the updates on the prior term approxima- 
tions related to v in line 8 of Algorithm 1. The relative time required for computations is negligible 
compared with lines 2-7 because the output weights are allowed to change relatively slowly by 
damping the updates on d, and /3, in line 4. For this reason we ran the EP algorithm for the prior 
term approximations of v to convergence after each parallel update of ^(v) on fine 7 to make sure 
that components of v are distributed at positive values at all times. Because of the propagation of in- 
formation between approximations ^(w^) via the hierarchial scale parameter approximations ^((|)/), 
larger number of iterations (typically 10-40) were required for convergence of the updates on the 
hierarchical prior term approximations related to w in line 1 of Algorithm 1. At least two sensible 
update schemes can be considered for EP on the input weight priors after sufficient convergence is 
achieved with the initial Gaussian priors defined using ju^^j and d^^: 1) The EP algorithm in line 
1 is run only once until convergence and then the other parameters {f;,i>,,Q:,,/3;,/ie,/,6Q ;}"^; and 
{fiv,k,^lk}k=i are iterated to convergence with fixed {j-iw,j,^wj}f£i or 2) the EP algorithm in fine 1 is 
run once until convergence and after that only one inner iteration is done on {fiy^,j,ci'^ j,fii^j,a'^ j}f=i 
in line 1. In the first scheme a fixed sparsity-favoring Gaussian prior is constructed using the cur- 
rent likelihood term approximations and in the latter scheme the prior is iterated further within the 
EP algorithm for the fikefihood terms. The latter scheme usually converges more slowly and re- 
quires more damping. Damping the updates by 5 G [0.5,0.7] and choosing a fraction parameter 
r\ G [0.7,0.9] resulted in numerically stable updates and convergence for the EP algorithms on the 
prior term approximations. 

The fraction parameter r| used in updating the likelihood term approximations according to 
equations (23)-(25) has a significant effect on the behavior of the algorithm. Because the approx- 
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imate tilted distributions (36) and (41) are often multimodal when the prediction resulting from 
the cavity distributions q-ii\) and q^i{hi) does not fit well the left out observation j,, the value 
of ri affects significantly the Gaussian approximation q{hi^k) = ^{hi,k)\^i,k^^i,k)- When r) is close 
to one and the discrepancy between and the cavity prediction is large, the resulting multimodal 
tilted distribution is represented with a very wide Gaussian distribution. If there are no other data 
points supporting the deviating information provided by y,, the model may simply attempt to widen 
the predictive distribution at x,. Consequently, the updates on the sites with large discrepancies are 
often more difficult because of large changes to r, and £>, . Furthermore, the approximation may not 
fit well the training data if there are isolated data points that caimot be considered as outliers. If 
r\ is smaller, for example r\ G [0.4,0.7], a fraction 1 — r| of the site approximation ?WA,/(wi:lr,, i>,) 
from the previous iteration is left in the cavity distribution and the discrepancy between the cavity 
prediction and y,- is usually small. Consequently, the model fits more accurately to the training data, 
the EP updates are numerically more robust, and usually less damping is required. However, in the 
experiments we found that with smaller values of x\ the model can also overfit more easily which is 
why we set t] = 0.95. 

4. Experiments 

First, three case studies with simulated data were carried out to illustrate the properties of the pro- 
posed EP-based neural network approach with sparse priors (NN-EP). Case 1 compares the effects 
of integration over the uncertainty resulting from a sparsity-favoring prior with a point-estimate 
based ARD solution. Case 2 illustrates the benefits of sparse ARD priors on regularizing the pro- 
posed NN-EP solution in the presence of irrelevant features and various input effects with different 
degree of nonlinearity. Case 3 compares the parametric NN-EP solution to an infinite Gaussian 
process network using observations from a discontinuous latent function. In cases 1 and 3, compar- 
isons are made with an infinite network (GP-ARD) implemented using a Gaussian process with a 
neural network covariance function and ARD-priors with separate variance parameters for all input 
weights (Williams, 1998; Rasmussen and Williams, 2006). The neural-network covariance func- 
tion for the GP-prior can be derived by letting the number of hidden units approach infinity in a 
2-layer MLP network that has cumulative Gaussian activation functions and fixed zero-mean Gaus- 
sian priors with separate variance (ARD) parameters on the input-layer weights related to each input 
variable (Williams, 1998). Point estimates for the ARD parameters, the variance parameter of the 
output weights, and the noise variance were determined by optimizing the marginal likelihood with 
uniform priors on the log-scale. Finally, the predictive accuracy of NN-EP is assessed with four 
real-world data sets and comparisons are made with a neural network GP with a single variance 
parameter for all input features (GP), a GP with ARD priors (GP-ARD), and a neural network with 
hierarchical ARD priors (NN-MC) inferred using MCMC as described by Neal (1996). 

4.1 Case 1: Overfitting of the ARD 

The first case illustrates the overfitting of ARD with a similar example as presented by Qi et al. 
(2004). Figure 2 shows a two-dimensional regression problem with two relevant inputs xi and X2. 

The data points are obtained from three clusters, {/(x) = l|xi > 0.5, X2 > 0.5}, {/(x) = 0|0.5 > 
x\ > —0.5,0.5 >X2> —0.5}, and {/(x) =0.8|xi < — 0.5,X2 < —0.5}. The noisy observations were 
generated according toy = /(x) + 8, where 8 ~ iV^(0,0.1^). The observations can be explained by 
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Figure 2: Case 1 : An example of the overfitting of the point-estimate based ARD on a simulated 
data set with two relevant input features, (a) A GP model with a neural network covari- 
ance function and point-estimates for the ARD parameters, (b) An EP approximation for 
a neural network with 10 hidden units and independent Laplace priors with one common 
unknown scale parameter ^ on the input weights, (c) and (d) The 95 % approximate 
marginal posterior probability intervals for the input weights and the output weights of 
the EP-based neural network. 



using a combination of two step functions with only either one of the input features but a more 
robust model can be obtained by using both of them. 

Subfigure (a) shows the predictive mean of the latent function /(x) obtained with the optimized 
GP-ARD solution. Input X2 is effectively pruned out and almost a step function is obtained with 
respect to input xi. Subfigure (b) shows the NN-EP solution with K =10 hidden units and Laplace 
priors with one common unknown scale parameter (|)i on the input weights w. The prior for (|)i 
was defined as ~ lA^(/j([,.(),a^ q), where /n^x) = 21og(0.1) and ct^q ~ l-^^- "^^^ noise variance 
was inferred using the same prior definition for both models: 6 = log(a^) ~ i'\^(/7e o,ag q), where 
fiQfi = 21og(0.05) and cTq q = 1.5^. NN-EP produces a much smoother step function that uses both 
of the input features. Despite of the sparsity favoring Laplace prior, the NN-EP solution preserves 
the uncertainty on the input variable relevances. This shows that the approximate integration over 
the weight prior can help to avoid pruning out potentially relevant inputs. Subfigure (c) shows the 
95% approximate marginal posterior probability intervals derived from the Gaussian approxima- 
tions q{wk) with the same ordering of the weights as in vector = [wj, w^] (every third weight 
corresponds to the input bias term). The vertical dotted lines separate the input weights associ- 
ated with the different hidden units. Subfigure (d) shows the same marginal posterior intervals for 
the output weights computed using ^(v). Only hidden units 5 and 6 have clearly nonzero output 
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weights and input weights corresponding to the input variables xi and X2 (see the first two weight 
distributions in triplets 5 and 6 in panel (c)). For the other hidden units, the input weights related 
to xi and X2 are distributed around zero and they have neghgible effect on the predictions. In panel 
(c), the third input weight distribution corresponding to the bias term in each triplet are distributed 
in nonzero values for many unused hidden units but these bias effects affect only the mean level 
of the predictions. These nonzero bias weight values may be caused by the observations not being 
normahzed to zero mean. The weights corresponding to hidden unit 1 differ from the other unused 
units, because a linear action function was assigned to it for illustration purposes. If required, a 
truly sparse model could be obtained by removing the unused hidden units and running additional 
EP iterations until convergence. 

4.2 Case 2: The Effect of Sparse Priors in a Regression Problem Consisting of Additive 
Input Effects with Different Degree of Nonlinearity 

The second case study illustrates the effects of sparse priors using a similar regression example as 
considered by Lampinen and Vehtari (2001). In our experiments we found two main effects from ap- 
plying sparsity-promoting priors with adaptive scale parameters = on the input-layer. 
Firstly, the sparse priors can help to suppress the effects of irrelevant features and protect from 
overfitting effects in input variable relevance determination as illustrated in Case 1 (Section 4.1). 
Secondly, sparsity-promoting priors with adaptive prior scale parameters <^ can mitigate the effects 
of unsuitable initial Gaussian prior definitions on the input layer (too large or too small initial prior 
variances a^^j, see Section 3.3.1 for discussion on the initialization). More precisely, the sparse pri- 
ors with adaptive scale parameters can help to obtain better data fit and more accurate predictions by 
shrinking the uncertainty on the weights related to irrelevant features towards zero and by allowing 
the relevant input weights to gain larger values which are needed in modeling strongly nonlinear (or 
step) functions. Placing very large initial prior variances o-^j on all weights enables the model to fit 
strong nonlinearities but the initial learning phase is more challenging and prone to end up in poor 
local minima. In this section, we demonstrate that switching to Gaussian ARB priors with adaptive 
scale parameter after a converged EP solution is obtained with fixed Gaussian priors can 
reduce the effects of irrelevant features, decrease the posterior uncertainties on the predictions on 
/(x), and enable the model to fit more accurately latent nonlinear effects. 

A data set with 200 observations and ten input variables with different additive effects on the 
target variable was simulated. The black lines in Figure 3 show the additive effects as a function 
of each input variable xij. The targets were calculated by summing the additive effects together 
and adding Gaussian noise with a standard deviation of 0.2. The first input variable is irrelevant and 
variables 2-5 have an increasing linear effect on the target. The effects of input variables 6-10 are 
increasingly nonlinear and the last three of them require at least three hidden units for explaining 
the observations. 

Figure 3(a) shows the converged NN-EP solution with fixed zero-mean Gaussian priors on the 
input weights. The number of hidden units was set to A' = 10 and the noise variance was inferred 
using the prior definition /^e,o = 21og(0.05) and cJqq = 2-^. The Gaussian priors were defined by 
initializing the prior site parameters of the input weights as {ji^j = 0, c^j = 0.4^} . The dark grey 
fines illustrate the posterior mean predictions and the shaded light gray area the 95% approximate 
posterior predictive intervals of the latent function /(x). The graphs are obtained by changing 
the value of each input in turn from —5 to 5 while keeping the others fixed at zero. The training 
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Figure 3: Case 2: An artificial regression problem where the observations are formed as a sum of 
additive input effects dependent on ten input features. The true additive effects are shown 
with black lines and the estimated mean predictions with dark grey lines. The 95% pos- 
terior predictive intervals are shaded with light grey, (a) A converged EP approximation 
for a neural network with ten hidden units and fixed zero-mean Gaussian priors on the 
input weights, (b) The resulting EP approximation when the Gaussian priors of the net- 
work in panel (a) are replaced with Gaussian ARD priors with separate scale parameters 
(|)i, ...,<|)rf for all input variables, and additional EP iterations are done until a new con- 
verged solution is obtained. Figure 4 visualizes the approximate posterior distributions of 
the parameters of the ARD network from panel (b). 



observations are obtained by sampling all input variables linearly from the interval Xi^j G [— 
Panel (b) shows the resulting NN-EP solution when the Gaussian priors of the network in panel (a) 
are replaced with Gaussian ARD priors with adaptive scale parameter (])i, ...,(])^ and additional EP 
iterations are done until convergence. Prior distributions for the scale parameters were defined as 
~ -^(/^(|),0) o)' where mff^ — 21og(0.01) and cj^q = 2.5^. This prior definitions favors small 
input variances close to 0.01 but enables also larger values around one. It should be noted that the 
actual variance parameters y of the prior site approximations can attain much larger values from 
the EP updates. 

With the Gaussian priors (Figure 3(a)), the predictions do not capture the nonlinear effects very 
accurately and the model produces a small nonzero effect on the irrelevant input 1 . Applying the 
ARD priors (Figure 3(b)) with additional iterations produces clearly more accurate predictions on 
the latent input effects and effectively removes the predictive effect of input 1. The overall approx- 
imate posterior uncertainties on the latent function /(x) are also smaller compared with the initial 
Gaussian priors. We should note that the result of panel (a) depends on the initial Gaussian prior 
definitions and choosing a smaller a^^ = 0.2^ or optimizing it could give more accurate predictions 
compared with the solution visualized in panel (a). 

Figure 4 shows the 95% posterior credible intervals for the input weights w (a), the prior scale 
parameters (|) i , . . . , (b), and the output weights v (c) of the NN-EP approximation with ARD priors 
visualized in Figure 3(b). In panel (a) the input weights from the different hidden units are grouped 
together according to the different additive input effects 1-10, and the weights related to the linear 
effects 1-5 are scaled by 40 for illustration purposes, because they are much smaller compared with 
the weights associated with the nonlinear input effects 6-10. From panels (a) and (c) we see that 
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Figure 4: Case 2: Visualization of the model parameters related to the artificial regression problem 
shown in Figure 3. Panels (a), (b), and (c) show the 95% marginal posterior credible 
intervals for the input weights w, the scale parameters ^i,...,(^d, and the output weights v 
of the neural network with Gaussian ARD priors from Figure 3(b). In panel (a) the input 
weights associated with each additive input effect (1-10) are grouped together (the bias 
terms are not shown). The weight distributions related to the linear input effects 1-5 are 
much smaller compared with the nonlinear effects 6-10, which is why they are scaled by 
40 for better illustration in panel (a). 



only hidden units are 1-5 and 9 have clearly non-zero effect on the predictions. The linear effects 
of inputs 1-5 are modeled by unit 3 that has very small but clearly nonzero input weights in panel 
(a) and a very large output weight in panel (a). The input weights related to the irrelevant input 1 
are all zero in the 95% posterior confidence level. By comparing panels (a) and (c) we can also 
see that hidden units 1, 2, 4, 5, and 9 are most probably responsible for modeUng the nonhnear 
input effects 6-7 because of large input weights values. Panel (b) gives further evidence on this 
interpretation because the scale parameters associated with the nonlinear input effects 6-10 are 
clearly larger compared to effects 1-5. The scale parameters associated with the linear input effects 
1-5 increase steadily as the magnitudes of the true effects increase. These results are congruent 
with the findings of Lampinen and Vehtari (2001) who showed by MCMC experiments that with 
MLP models the magnitudes of the ARD parameters and the associated input weights also reflect 
the degree of nonlinearity associated with the latent input effects, not only the relevance of the input 
features. 

4.3 Case 3: Comparison of a Finite vs. Infinite Network with Observations from a Latent 
Function with a Discontinuity 

The third case study compares the performance of the finite NN-EP network with an infinite GP 
network in a one-dimensional regression problem with a strong discontinuity. Figure 5 shows the 
true underlying function (black fines) that has a discontinuity at zero together with the noisy ob- 
servations (black dots). Panel (a) shows the predictive distributions obtained using NN-EP with ten 
hidden units {K = 10) and Laplace priors with one common scale parameter (|). The prior distribution 
for the scale parameter was defined with ju^fl = 21og(0.01) and CJ^ q = 2.5^, and the noise variance 
cs^ was inferred from the data using the prior definition |ie,o = 21og(0.05) and Og q = 2^. Panel (b) 
shows the corresponding predictions obtained using a GP with a neural network covariance func- 
tion. With the GP network the noise variance was optimized together with the other hyperparameters 
using the marginal likelihood. The finite NN-EP network explains the discontinuity with a slightly 
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Figure 5: Case 3: An artificial regression problem consisting of noisy observations (black dots) 
generated from a latent function (black lines) that has a discontinuity at zero. Panel (a) 
shows the mean predictions (dark grey line) and the 95% credible intervals (light gray 
shaded area) obtained using the proposed EP approach for a NN with ten hidden units 
and Laplace priors with one common scale parameter (]) on the input weights. Panel (b) 
visualizes the corresponding predictive distribution obtained using a GP with a neural 
network covariance function. 



smoother step compared to infinite GP network, but the GP mean estimate shows fluctuations near 
the discontinuity. It seems that the infinite GP network fits more strongly to individual observations 
near the discontinuity. This shows that a flexible parametric model with a limited complexity may 
generalize better with finite amount of observations even though the GP model includes the correct 
solution a priori. This is in accordance with the results described by Winther (2001). 

4.4 Predictive Comparisons with Real World Data 

In this section the predictive performance of NN-EP is compared to three other nonlinear regression 
methods using the following real-world data sets: the concrete quality data (Concrete) analyzed 
by Lampinen and Vehtari (2001), the Boston housing data (Housing) and the unnormahzed Com- 
munities and Crime data (Crime) that can be obtained from the UCI data repository (Frank and 
Asuncion, 2010), and the robot arm data (Kin40k) utilized by Schwaighofer and Tresp (2003).^ 
The number of observations n and the number of input features d are shown in Table 1 for each 
data set. The Kin40k includes originally only 8 input features but we added 92 irrelevant uniformly 
sampled random inputs to create a challenging feature selection problem. The columns of the input 
matrices X and the output vectors y were normalized to zero mean and unit variance for all methods. 
The predictive performance of the models was measured using the log predictive densities and the 
squared errors evaluated with separate test data. We used 10-fold cross-validation with the Housing, 
Concrete, and Crime data, whereas with Kin40k we chose randomly 5000 data points for training 
and used the remaining observations for validation. 



3. Kin40k data is based on the same simulation of the forward kinematics of an 8 hnk all-revolute robot arm as the 
Kin family of data sets available at http : / /www . cs . toronto . edu/ ~ delve/ except for lower noise level and larger 
amount of observations. 
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The proposed NN-EP solution was computed using two alternative prior definitions: Laplace 
priors with one common scale parameter (|) (NN-EP-LA), and Gaussian ARD priors with separate 
scale parameters for all inputs including the input bias terms (NN-EP-ARD). With both 

prior frameworks, the hyperpriors for the scale parameters were defined as (|)/ ~ 0\C{fii^,o,(^lo), where 
l^fi = 21og(0.01) and 0^0 = '2-5^ ■ This definition encourages small input weight variances (around 
0.01^) but enables also large input weight values if required for strong nonlinearities assuming the 
input variables are scaled to unit variance. The noise level 6 = log(a^) was inferred from data with 
a prior distribution defined by = 21og(0.01) and CJg q = 2^, which is a sufficiently flexible prior 
when the output variables y are scaled to unit variance. The methods used for comparison include 
an MCMC -based MLP network with ARD priors (NN-MC) and two GPs with a neural network 
covariance function: one with common variance parameter for all inputs (GP), and another with 
separate variance hyperparameters for all inputs (GP-ARD). With both GP models the hyperparam- 
eters were estimated by gradient-based optimization of the analytically tractable marginal likelihood 
(Rasmussen and Williams, 2006). For NN-MC and NN-EP, we set the number of hidden units to 
^ = 10 with the Housing, Concrete, and Crime data sets. With the Kin40k data, we set = 30 
because n is large and fewer units were found to produce clearly worse data fits. 

Table 1 summarizes the means (mean) and standard deviations (std) of the log predictive densi- 
ties (LPDs) and the squared errors (SEs). Because the distributions of the LPD values are heavily 
skewed towards negative values, we summarize also the lower 1% percentiles (prct 1%). Similarly, 
because the SE values are skewed towards positive values we summarize also the 99% percentiles 
(prct 99%). These additional measures describe the quality of the worst case predictions of the 
methods. Table 1 summarizes also the average relative CPU times (cputime) required for parameter 
estimation and predictions using MATLAB implementations. The GP models were implemented 
using the GPstuff* toolbox and NN-MC was implemented using the MCMCstuff^ toolbox. The 
CPU times were averaged over the CV-folds and scaled so that the relative cost for NN-EP is one. 
These running time measures are highly dependent on the implementation, the tolerance levels in 
optimization and iterative algorithms, and the number of posterior draws, and therefore they are 
reported only to summarize the main properties regarding the scalability of the different methods. 
When assessing the results with respect to the Housing and Concrete data sets, it is worth noting 
that there is evidence that an outlier-robust observation model is beneficial over the Gaussian model 
used in this comparison with both data sets (Jylanki et al., 201 1). 

Table 1 shows that NN-EP-LA performs slightly better compared to NN-EP-ARD in all data 
sets except in Kin40k, where NN-EP-ARD gives clearly better results. The main reason for this 
is probably the stronger sparsity of the NN-EP-ARD solutions: In Kin40k data there are a large 
number truly irrelevant features that should be completely pruned out of the model, whereas with 
the other data sets most features have probably some relevance for predictions or at least they are not 
generated in a completely random manner. Further evidence for this is given by the clearly better 
performance of GP-ARD over GP with the Kin40k data. 

If the mean log predictive densities (MLPDs) are considered, the NN-MC approach based on 
a finite network performs best in all data sets except with Kin40k, where the infinite GP-ARD 
network is slightly better The main reason for this is probably the strong nonlinearity of the true 
latent mapping, which requires a large number of hidden units, and consequently the infinite GP 



4. http : //bees . aalto . f i/en/research/bayes/gpstuf f / 

5. http: //bees . aalto . f i/en/research/bayes/mcmcstuf f / 
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Table 1: A predictive assesment of the proposed EP approach for neural networks with two dif- 
ferent prior definitions: Laplace priors with one common scale parameter (|) (NN-EP-LA) 
and Gaussian ARD priors with separate scale parameters for all inputs (NN- 

EP-ARD). The methods used for comparison include a neural network with ARD priors 
inferred using MCMC (NN-MC), and two GPs with a neural network covariance: one 
with a common variance hyperparameter for all inputs (GP), and another with separate 
variance hyperparameters for all inputs (GP-ARD). The log predictive densities are sum- 
marized with their means, standard deviations (std), and lower 1% percentiles (prct 1%). 
The squared errors are summarized with their means, standard deviations (std), and upper 
99% percentiles (prct 99%). 



Housing 


log predictive density (LPD) 


squared error (SE) 




(?i=506, d=l3) 


mean 


std 


prct 1% 


mean 


std 


prct 99% 


cputime 


NN-EP-LA 


-0.44 


1.64 


-7.55 


0.15 


0.45 


2.42 


1.0 


NN-EP-ARD 


-0.50 


1.66 


-6.31 


0.17 


0.49 


1.60 


1.0 


NN-MC 


-0.08 


1.17 


-4.54 


0.11 


0.50 


1.18 


110.5 


GP 


-0.29 


2.35 


-7.57 


0.13 


0.53 


1.98 


0.3 


GP-ARD 


-0.20 


2.00 


-10.71 


0.10 


0.37 


1.53 


1.0 


Concrete (n=215, J=27) 


NN-EP-LA 


0.18 


0.85 


-3.05 


0.05 


0.08 


0.30 


1.0 


NN-EP-ARD 


0.05 


1.03 


-4.61 


0.05 


0.11 


0.57 


0.8 


NN-MC 


0.22 


1.52 


-3.62 


0.04 


0.08 


0.28 


103.0 


GP 


-0.07 


1.70 


-5.12 


0.06 


0.11 


0.66 


0.03 


GP-ARD 


0.15 


1.98 


-4.23 


0.04 


0.08 


0.28 


0.6 


Crime (n=1993, d=l02) 


NN-EP-LA 


-0.83 


0.89 


-4.64 


0.31 


0.55 


2.60 


1.0 


NN-EP-ARD 


-0.84 


0.89 


-4.81 


0.31 


0.55 


2.75 


0.2 


NN-MC 


-0.80 


0.93 


-4.81 


0.29 


0.53 


2.60 


19.8 


GP 


-0.81 


0.91 


-4.80 


0.30 


0.54 


2.69 


0.2 


GP-ARD 


-0.81 


1.01 


-5.49 


0.30 


0.55 


2.75 


4.4 


Kin40k (n=5000, J=100) 


NN-EP-LA 


-0.59 


0.89 


-4.27 


0.19 


0.29 


1.38 


1.0 


NN-EP-ARD 


0.27 


1.19 


-4.63 


0.03 


0.08 


0.37 


0.9 


NN-MC 


0.49 


1.51 


-5.37 


0.02 


0.07 


0.26 


48.7 


GP 


-1.15 


0.72 


-4.18 


0.58 


0.83 


4.06 


0.5 


GP-ARD 


0.64 


1.11 


-3.90 


0.02 


0.05 


0.24 


32.3 



network with ARD priors gives very accurate predictions. In pair-wise comparisons the differences 
in MLPDs are significant in 95% posterior confidence level only with Housing and Kin40k data sets. 
In terms of mean squared errors (MSEs), GP-ARD is best in all data sets except Crime, but with 
95% confidence level the pair- wise differences are significant only with the Kin40k data. With the 
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Kin40k data, the performance of NN-MC could probably be improved by increasing K or drawing 
more posterior samples, because learning the nonlinear mapping with a large number of unknown 
parameters and potentially multimodal posterior distribution may require a very large number of 
posterior draws. 

When compared with NN-MC and GP-ARD, NN-EP gives slighdy worse MLPD scores with 
all data sets except with Concrete. The pair-wise differences in MLPDs are significant with 95% 
confidence level in all cases except with the Concrete data. In terms of MSE scores, NN-EP is also 
slightly but significantly worse with 95% confidence level in all data sets. By inspecting the std:s 
and 1% percentiles of the LPDs, it can be seen that NN-EP achieves better or comparable worst 
case performance when compared to GP-ARD. In other words, NN-EP seems to make more cau- 
tions predictions by producing less very high or very low LPD values. One possible explanation for 
this behavior is that it might be an inherent property of the chosen approximation. Approximating 
the possibly multimodal tilted distribution p{hi^ic), where one mode is near the cavity distribution 
q-i{hi±) and another at the values of /i;,t giving the best fit for ji, with an unimodal Gaussian 
approximation as described in Appendix C, may lead to reduced fit to individual observations. An- 
other possibiUty is that the EP-iterations have converged into a suboptimal stationary solution or 
the maximum number of iterations has been exceeded. Doing more iterations or using an altema- 
tive non-zero initialization for the input-layer weights might result in better data fit. The second 
possibility is supported by the generally acknowledged benefits from different initializations, for 
example, the unsupervised schemes discussed by Erhan et al. (2010), and our experiments using 
the Kin40k data without the extra random inputs. We found that initializing the location parameters 
pv^k and fA„ j of the prior site approximations (13) and (14) using a gradient-based MAP estimate 
of the weights w and v, and relaxing the prior site approximations after initial iterations using the 
proposed EP framework, can result in better MSE and MLPD scores. However, such alternative 
initialization schemes were left out of these experiments, because our aim was to test how good 
performance could be obtained using only the EP algorithm with the zero initialization described in 
Section 3.3.1. 

The CPU times of Table 1 indicate that with small n the computational cost of NN-EP is larger 

compared to GP-ARD, which requires only one 0{rr') Cholesky decomposition per analytically 
tractable marginal likelihood evaluation. However, as n increases GP-ARD becomes slower, which 
is why several different sparse approximation schemes have been proposed (see, e.g, Rasmussen 
and Williams, 2006). Furthermore, assuming a non-Gaussian observation model, such as the bi- 
nary probit classification model, GP or GP-ARD would require several 0{rr') iterations to form 
Laplace or EP approximations for the marginal likelihood at each hyperparameter configuration. 
With NN-EP, probit or Gaussian mixture models could be used without additional computations. 
The computational cost of NN-EP increases linearly with n and K, but as d increases the posterior 
updates of qiy/k), which scale as 0{K(fi), become more demanding. The results of Table 1 were 
generated using a sequential scheme for updating q{y/k) (see Algorithm 1), which can be seen as 
larger computational costs with respect to NN-MC with the Crime and Kin40k data sets. One option 
with larger d is to use parallel EP updates, but this may require more damping or better initiaUzation 
for the input weight approximations. Another possibility would be to use fully factorized posterior 
approximations in place of q{v/k), or to assign different overlapping subgroups of the input features 
into the different hidden units and to place hierarchical prior scale parameters between the groups. 



28 



Expectation Propagation for Neural Networks with Sparse Priors 



5. Discussion 

In this article, we have described how approximate inference using EP can be carried out with a 
two-layer NN model structure with sparse hierarchical priors on the network weights, resulting in a 
novel method for nonUnear regression problems. 

We have described a computationally efficient EP algorithm that utilizes independent approxi- 
mations for the weights associated with the different hidden units and layers to achieve computa- 
tional complexity scaling similar to an ensemble of K sparse linear models. More generally, our 
approach can be regarded as a non-linear adaptation of the various EP methods proposed for sparse 
linear regression models. This is achieved by constructing a factorized Gaussian approximation 
for the posterior distribution resulting from the nonlinear MLP model structure with a linear input 
layer, and adapting the algorithms proposed for sparse linear models separately on the independent 
Gaussian approximations for each hidden unit. Because of the structure of the approximation, all 
existing methodology presented for faciUtating the computations in sparse Unear models can be ap- 
plied on the hidden unit approximations separately. We have also introduced an EP framework that 
enables definition of flexible hierarchial priors using higher level scale parameters that are shared 
by a group of independent linear models (in our case the hidden units). The proposed EP approach 
enables efficient approximate integration over these scale parameters simultaneously with the co- 
efficients of the linear models. We used this framework for inferring the common scale parameter 
of Laplace priors assigned on the input weights, and to implement Gaussian ARD priors for the 
input-layer. In this article, we have focused on the Gaussian observation model, but the method can 
be readily extended to others as well (e.g., binary probit classification and robust regression with 
Gaussian mixture models). 

Using simple artificial examples we demonstrated several desirable characteristics of our ap- 
proach. The sparsity promoting priors can be used to suppress the confounding predictive influ- 
ences of possibly irrelevant features without the potential risk of overfitting associated with point- 
estimate based ARD priors. More precisely, the approximate integration over the posterior uncer- 
tainty helps to avoid pruning out potentially relevant features in cases with large uncertainty on 
the input relevances. Albeit more challenging to estimate, the finite parametric model enables a 
posteriori inspection of the model structure and feature relevances using the hyperparameter and 
weight approximations. Furthermore, the parametric model structure can also be used to construct 
more restricted models by assigning different input variables into different hidden units, grouping 
the inputs using the hierarchical scale priors, using different nonlinear activation functions for the 
different hidden units, or using fixed interaction terms dependent on certain hidden units as inputs 
for the output-layer. 

In deriving the EP algorithm, we have also described different computational techniques that 
could be useful in other models and approximation methods. These include the EP approximation 
for the hierarchical priors on the scale parameters of the weights that could be useful in combining 
sparse linear models associated with different subjects or measurement instances, the noise estima- 
tion framework that could be used for estimating the likelihood parameters in sparse linear models 
or approximate Gaussian filtering methods, and the proposed approach for approximating the tilted 
distributions of the hidden unit activations that could be useful in forming EP approximations for 
observation models involving sums of nonlinear functions taken from random variables with factor- 
ized Gaussian posterior approximations. 
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Appendix A. Cavity Distributions with the Factorized Approximation 

Assuming the factorized approximation of equation (20) for q{z), and applying the transformation 
h; = x/^w on (23) results in the following cavity distribution for h,-: ^-((h;) = Hit ^{hi,k\m-i^k,V-i,k)- 
The scalar cavity means and variances are given by 

m-i^k = V-i,k{Vr^^mi^k - 'HVi.yt), (31) 

where the mean and variance of hi^k under the current approximation (21) are denoted with mt^k = 
xT/iwt and Vi^k = xj^^w^x/, respectively. Using (22) and (23) the j:th cavity distribution for v can be 
written as ^-,(v) = fA£(v|/Lt_,-, 5]_,) and the cavity mean and covariance are given by 

IJ.-i = a + 'EyaiS^^aJa, (32) 

where s = r\^^ — ajYlyfCti and a = /Xy — rjSvA- Using (31) and (32) the cavity evaluations can be 
implemented efficiently: for the input weights v/t only scalar moments of hi k need to be determined, 
and for the output weights v rank-one matrix updates are required. The cavity computations for the 
noise level term approximations (12) and the weight prior term approximations (13, 14) require only 
manipulation of univariate Gaussian distributions and can be implemented similarly as in (31). 

Appendix B. Tilted Moments for the Output Weights 

To obtain the desired site approximation structure (19) and closed form expressions for the the 

corresponding site parameters (r,-, i>,-, a,, and 0i) satisfying the moment matching condition (25) 
we need to form suitable approximations for the marginal means and covariances of hj^ic and v 
resulting from the tilted distribution (24). We start by assuming the noise level 6 known and extend 
the presented approach for approximate integration over ^-,(6) later. 

We first consider an approximate scheme which has already been utilized in the unscented 
Kalman filtering framework for inferring the weights of a neural network (Wan and van der Merwe, 
2000). Adopting the approach to our setting, first a cavity-predictive joint Gaussian approxima- 
tion is formed for the random vector [uf,y,]^ = [hj^, v^,y,]^, which is distributed according to 
q{hi,\,yi\Q) oc p[yi\fi,Q)'^q_i(hi,\). This is done by approximating the central moments E(y,|9), 
Var(j,|0), Cov(h;,y,|6), and Cov(v,y,l0) using the unscented transform. Approximations for the 
mean and covariance of the tilted distribution (24) can now be determined by conditioning on in 
the joint Gaussian approximation of [uj, to obtain E(ui|3?;,6) and Cov(u,|}?;,6), and plugging 
in the observation j,. In our experiments, this approach was found sufficiently accurate for approx- 
imating the moments of v, which is probably explained by the conditional linear dependence of 
/, on V in the observation model. Thus, we approximate the marginal tilted distribution of v with 
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A(v|e)«i?V:(A,-(6),^/(6))' where 

fii{Q)=^i^i + T.,jyy-\yi-mf;) 

i:Ke) = 5:-,--W,7'^v,^' (33) 

and Vy. = V/. +11^^ exp(9). Here we have defined the required cavity predictive moments in terms 
of fi = v^g(h,) instead of y,- to faciUtate the upcoming approximate integration over ^-,(6), and 
assuming the factorized posterior approximation (20), these central moments can be written as 

mf, = E(/,) = At^,mg, 

Vf, = Nw{fi) = mT.S_;mg, +VT(diag(5]_0 + iU_,-o^_,.) 

= Cov(v, fi) = 5:_,-mg, , (34) 

where o denotes the element- wise matrix product, and the 1) x 1 vectors = E(g(h,)) and 
Vg. = Var(g(h,)) are formed by computing the means and variances from each component of g(h;) 
with respect to ^-/(h,) defined in (31). Note that the last elements of mg^ and Vg^ are one and zero 
corresponding to the output bias term vq. 

With the activation function (2) the mean m^. can be computed analytically as 

^{gihk)) = 2K-'l^ (m.i,k{\+V.i,k)-''') -0.5) , 
and for computing the variance Vg^ the following integral has to be evaluated numerically 

Varfefe)) =2(*:.)-/;"''"exp [-^^-^^i--^^ 

where p = V-!,/t(l + V_ijt)^^ Other activation functions could be incorporated by using only one- 
dimensional numerical quadratures whereas with the full posterior couplings (16) A'-dimensional 
numerical integrations would be required to approximate nif., Vf., and ^Dy,/;. 

Appendix C. Tilted Moments for the Hidden Unit Activations 

The challenge in approximating the mean and variance of Pi{hi^k) is that this marginal density can 
have multiple distinct modes, one related to the cavity distribution ^-,(h,) and another related to 
the hkehhood p(y,|v^g(h,),6), that is, to the values of hi^k that give better fit for the left-out obser- 
vation yi. In our numerical experiments, the previously described simple approach based on a joint 
Gaussian approximation for [h7,v^/;] was found to underestimate the marginal probability mass of 
the latter mode related to y,- especially in cases where the modes were clearly separated from each 
other. This problem was found to be mitigated by decreasing r|, which probably stems from leaving 
a fraction of the old site approximation 4 ; from the previous iteration in the approximation that in 
turn shifts the cavity towards the observation _y,. With some difficult data sets, rj-values as small as 
0.5 were found necessary for obtaining a good data fit but usually this also required more iterations 
for achieving convergence compared to larger values of rj. 

To form a robust approximation for the marginal tilted distributions of the hidden unit acti- 
vations hi^k also in case of multimodalities, we propose an alternative approximate method that 
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enables numerical integration over ^,(/i; <;|6) using one-dimensional quadratures. We aim to ex- 
plore numerically the effect of each hidden activation hju on the tilted distributions /5,(h;,v|6) 
fA^(y,|v^g(h;),exp(6))^^_;(h,)^_,(v) when all other activations \\.i-k and the weights v are aver- 
aged out. To approximate the marginalization over _i and v we approximate i), that is, 
the cavity distribution of fi resulting from ^_,(h,)^_,(v) by conditioning on hi^k, with a univariate 
Gaussian given by 

q-i{fi\hk) ^mfi\m{hk),V{hi,u)) , (35) 

where m{hi^k) and denote the mean and variance of /, computed with respect to <7_;(h, v) = 

q-i{hi-k)q-i{y). The required conditional moments m{hi^k) and V{hi^k) can be calculated using 
equation (34) by modifying the A::th element of mg^ and Vg; corresponding to the known values of 
hi^k< that is, [mg.]k = g{hi^k) and [Vg.]yt = 0. Note that the possible numerical integrations for deter- 
mining E(g(h,)) and Var(g(h;)) need to be computed only once for each site update and only the 
terms dependent on mg,. ^ have to be re-evaluated for each value of hi^k- The approximation (35) 
can be justified using the central limit theorem according to which the distribution of the sum in 
fi = Y,k'^i!fS{f^i,k') +vo given hj^t approaches a normal distribution as K increases. 

Using equation (35), we can write the following approximation for the marginal tilted distribu- 
tion of hi^k- 

Pi{hk\Q)°^ / iA^(y,-|vlj.g(h;_fc)-Fvfc/i,;;t,exp(e))'l^_,-(v) P[^_/(;i,-,;t')^^vi^h/_fc 

= / f?v:(y,i/;-,exp(e))\_,(/;i^,-,;t))?-K^,-,-t)^/;- 

f=iZ{e)i^C{yi\m{hi,k),V{hi,k) + ^-^exp{e))q_i{hi,k) 

« Zi,k{e)^{hi,k\mi,k{Q),Vi,k{e)) , (36) 

where all output weights excluding Vk are denoted by \-k and Z, yt(6) is a normalizing constant. In 
the last step we have substituted approximation (35) and carried out the integration over analyti- 
cally to giveZ(9) = (27iexp(9))(^^^^/^r|^'/^. Approximation (36) enables numerical inspection for 
the possible multimodality of 1 9), and the conditional tilted mean m;,t(9) and variance V;,yt(9) 
can be determined using a numerical quadrature. Compared with the simple approach described in 
Appendix B, equation (36) results in more accurate tilted mean estimates in multimodal cases. 

Appendix D. Tilted Moments with Unknown Noise Level 

If the noise level 9 is assumed unknown and estimated using the EP, the marginal mean jiiej and 
variance dg ^ can be approximated with a similar approach as was done for hij^ in Appendix C. We 
approximate first the cavity distribution of /,• with q-i{fi\Q) ~ ^^C{yi\mf.,Vf|), where the mean and 
variance are computed using (34). Then, assuming a Gaussian observation model, we can integrate 
analytically over fi to obtain a numerical approximation for the tilted distribution of 9: 

A-(e)- / i7V:(>';|vTg(hO,exp(9))\_,-(v)^_KhO^-/(eyvJhi 

^ Z{Q)9i {yi\mf„Vf, +Ti-iexp(9)) q.i{e) ^ Z,-f?v:(9|Pe,,-,ai,.), (37) 
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where Z(6) = (27iexp(9))(^ 1/2^ is an approximation for the normaUzation term of 

the tilted distribution (24). Using equation (37) the approximate mean //e variance dg j, and nor- 
mahzation term Z, can be calculated with a numerical quadrature, and if is known or fixed, the 
normalization term can by approximated with Z;(0) = Z(0)iV^ (jii"^/,) V/i +11^' exp(0)). 

To approximate the marginal means and covariances of v and hi^k with unknown the condi- 
tional approximations of equations (33) and (36) have to be integrated over qi (0) = Z[' ^Zi (0)^-/ (0) ~ 
^,(0) because we have ^,(hi,v,0) Zr^^,(h;, v|0)Z;(0)^_,(0) according to (37). Incase of the sim- 
ple joint Gaussian approximation for v we can write 

A« =E^,-(v)W =Ep,-(e) (Ep,(v|6)(v|0)) ?aE^,(e)(/ii(0)) 

= iU_,- + Sv,/; E^,(e) (^3-7 ' ) (3'; " '"z; ) > (38) 

where the conditional mean of v with respect to ^;(v|0) is approximated using (33), and the integra- 
tion over Vy7^ = [V/. +ri^^ exp(0))^^ can be done using a one-dimensional quadrature. Similarly, 
for the marginal covariance of v we can write 

ti = CoVp.(^)(v) = E^_.(e) (Covp,(^|e)(v|0)) +Cov^,(e) (E^,(^|e)(v|0)) 

« E^,.(e) {ti{Q)) +E^,(e) ((A/(e) - A,) (A.-(e) - fiif) 

= S-/- Sv,^. (E^,.(e) {V-') - (y,- -m^.)' Var^,(e) (V"^)) ^lf„ (39) 

where the conditional covariance of v with respect to Pi{\\Q) is approximated using (33) and 
Var^.(e) (^,,7') = E^^(e) (V^T^ — E^.(e)(K70)^ be computed with a numerical quadrature. 

For the output weights v the integration over the uncertainty of can be done without significant 
additional computational cost. The mean and variance of can be determined using the same 
function evaluations that are used in the quadrature integrations required for computing Ae,is f 
Z, according to equation (37). Approximating the marginal means and covariances of the hidden 
unit activations hi k is more demanding because integration over the approximate marginal tilted 
distribution resulting from approximation (36), 

would require a two-dimensional numerical quadratures for each hidden unit K. To reduce the 
computational burden, we approximate the probability mass of pi{hi^k,^) to be relatively sharply 
peaked near the marginal expected value Ae,/ resulting from (37) yielding 

pi{hi,k) « Zr'z{Q)^M (3;,|m(/j,;0, V(/i;,i) +ri-iexp(Ae,0) q-iihk) 

~ ^{hi,k\ma{MA.,kif^dj)) ■ (41) 

This approximation does not require additional computational effort compared to the conditional 
estimate (36) and the difference in accuracy compared to the two-dimensional quadrature estimate 
based on (40) is small after a few iterations provided that there are enough observations. 

Appendix E. Site Parameters and Updates 

In this appendix we present closed form expressions for the parameters of the site approximations 
(19) resulting from the moment matching condition (25) and the approximate tilted moments derived 
in Appendices B - D. 
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Using the moment matching condition S^"^ = Si - + r\ai&J resulting from (25) and approxi- 
mations (33) or (39) we can write the following expression for the scale parameter vector a,- of the 
/:th approximate site term ty^i related to the output weights: 

ai = mg,sign(a,-) \di\'/^ (l - diml^.im^)~'^\-'/\ (42) 

where a; = E^.(e) (^^,7^) — (y,- — m/J^ Var^.(e) (VyT^) with unknown 6 and di = Vy7^ otherwise. Sim- 
ilarly for the location parameter vector /3„ equation (25) results in the moment matching condition 
E^^/i,- = El- ^t_, +T1/9, that together with equation (33) or (38) gives 

/3, = mg,. (l-a/m^.S_/mg,.) ^ (a/m^.iLt-,-|-^,(y, -m/;.))ri"^ (43) 

where a,- is defined as in the previous equation, and hi = E^.(e) {V~^) when 6 is unknown and 
otherwise hi = V^^^. By looking at equations (42) and (43) we can now extend the discussion of the 
last paragraph of Section 3.1. The mean and covariance of the posterior approximation ^(v) defined 
in equation (22) can be interpreted as the posterior distribution of a linear model where the input 
features are replaced with the expected values of the nonlinearly transformed input layer activations 
lUg. = Eq_. (g(xTw)) and pseudo observations y,- = mg./i_,- + — my;.) are made according to 

an observation model ?V^(y, |mg.v,a,^^ — S_,mgJ. 

Damping the site updates can improve the numerical robustness and convergence of the EP al- 
gorithm, but applying damping on the site precision structure T,,vv = otiOtJ resulting from equations 
(22) and (42), that is, ff^ = (1 - 5)Q:f'*(af '*)T + daiaj, would break the outer product form of 
the /:th likelihood site approximation (19) and produce a computationally more demanding rank-A' 
site precision after K iterations. In case the input weight approximations ^(w^:) were kept fixed 
while updating the output weights v, the expected activations m(g, ) would remain constant and one 
could consider damping only the scalar terms on the right hand side of equations (42) and (43). 

In the more general case where also the site parameters X;i and v,jt related to the input weights 
are updated simultaneously, we can approximate the new site precision structure T"^* = A;AT, 
where A, = [( 1 — 8) V2Q,oid ^ 51/2Q, .] ^j^j 5, . obtained from (42), with its largest eigenvector at each 
site update step. This requires solving the eigenvector v,- corresponding to the largest eigenvalue 
Xi of the 2 X 2 matrix AjAi Ri V/X.,vj^ after which the new damped site parameter vector can be 
approximated as 

a^«^ = A,v,-. (44) 

Damping the site location vector 0i is straightforward because update /S?*^* = (1 — ?>)0f'^ + 8/3,- = b,-, 
where 0i is obtained from (43), will preserve the structure of the site approximation (19). However, 
approximation a"*^ = A,v,- changes the moment consistency conditions used in deriving (43) which 
is why 0f™ has to be modified so that combining it with af™ according to the moment matching 
rule (25) results in the same mean vector as the rank-2 site A,A7 combined with b,. In other 
words, we approximate the posterior covariance Sy resulting from the rank-two damped update 
but choose 0f^^ so that the mean fiy will be exact. This is achieved by updating the site location 
according to 

Pf^'- = hi + ^-%{yiyJ-l){AJ'E_iAi+^-'l)-'AJ{^J,_i + ^l:_ihi) (45) 
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where b,- = (1 - d)$f'^ + 5/3;. 

With the factorized posterior approximation (20) the parameters of the likelihood site approx- 
imation terms associated with the input weights decouple over the different hidden units k = 
l,...,K and consequently the moment matching condition (25) results in simple scalar site parameter 
updates. Using the moment matching condition with the cavity definitions (31) and the approxima- 
tion (36) or (41) gives the following site updates 

= (1 -5)v + 5Ti-Ht/r,' - V-;,) =x,, + 5Ti-Ht>a' -^^a') (46) 
vj;r = {l-5)Vi,k + d^-\V,-^'mi,k-V:l^m.i,k)=Vi,k + d^-\V.^^^ 

where 5 G (0, 1] is a damping factor and the marginal tilted moments resulting from either fixed or 
unknown 9 are denoted simply with m,- and Vf^ic- Equation (46) shows that the EP iterations on the 
input weights w^: have converged when the approximate marginal means nii^k and variances Vi^k of 
the activations hi^k from all hidden units are consistent with all tilted distributions (24). 

Appendix F. Computing the Predictions 

The prediction for a new test input x* can be computed using approximations (17), (21) and (22), as 
follows 

p{y*\^*) ~ / p(3'*|/(x*),6)^(v|/Liy,Sv)n^(^*=l^«'^'^"'J^(9|/'e,cfe)'iv(iw(ie 

= f f7V:(j*|m/.,V/,+exp(e))^(0)je, (47) 

where the approximate mean m/, and Vf^ of the latent function value /* = Vytg(wjx*) + vq is 
approximated in the same way as in equation (34). The cavity mean and covariance are 
replaced with jx^ and 'Siy, and the activation means mg^ = U{g{h.^)) and variances Vg, = Var(g(h*)) 
are computed with respect to the approximations q{y/k)- The predictive mean is given by E(j* |x*) = 
E(E(j*|x*,0)) = nif^. The predictive variances Var(j*|x*) = E(Var(y*|x*,6)) + Var(E(y*|x*,6)) = 
Vf^ +E(exp(6)) and the predictive densities p{y^\x^), can be approximated either with a plugin 
value for = ^ue or by integration over 6 using a numerical quadrature (in the experiments we used 
numerical quadratures). 

Appendix G. Marginal Likelihood Approximation 

An EP approximation for the log marginal likelihood log /?(y|X) can be computed in a numeri- 
cally stable and efficient manner following the general EP formulation for Gaussian approximating 
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families summarized by Cseke and Heskes (201 1, appendix C) 

k=i 1=1 



1 ^ 



'1 (=1 ^ ^ 
+ ^ L ( - ^ (InM) + 'i'i^J^if - iiA^(Mv + a,-) 

+ mmi,k,Vi,k)-^im.i,k,V-i,k)] + — £ InZ,. 

I Kd 

-'PCavvcO -»I'(w,o,a^,o) - t »I'H,o,aJ,o), (48) 



where si = r]"^ + (a^Xlw^S:/, a, = iiy — riEv/3,, and the normaUzation term of the tilted distribution 
~ / p(3';k^S(h;))Q)^^-i(^)h!5Q)'^^'^h!'^Q is computed using (37). The normaUzation terms of 
the other tilted distributions are defined as 
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J P{n\olo)'^"q-k{vk)dvk and Z^.= j p{wj\(Sfi^)^-q-j{wj)q-j{(Sfi^)dwjd(^i^, 



and they can be computed using numerical quadratures. The normalization terms (also known as 
log partition functions) related to the Gaussian cavity and marginal distributions can be computed 

as 

^(m, = ^/^ V + ^ log ISI + ^ log(2ji), 

where /x and i>' = S^'/iare(ixl vectors and S is a li x (i matrix. In the fifth line of (48) the means 
and variances of the cavity distribution related to the prior term p{wj\<^ij) are denoted according 
to (see equation (26)) 

and the corresponding approximate marginal distributions are defined as 

Similar notation is used for the hkehhood term approximations of 6 in the second Une and the 
prior term approximations of {vk}f^i in the sixth line. The last line of (48) contains the constant 
normalization terms related to the fixed Gaussian priors of the output bias vq, the noise level 9 = 
loga^, and the input weight scales 
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All terms of equation (48) excluding *F(/Xv,Xlv) and ^w^) can be computed without 

significant additional cost simultaneously during the EP update of the corresponding site approx- 
imation. Term ^'{fiy^'Sy) can be computed using one Cholesky decomposition at each parallel 
update step of qiy/k) in line 7 of Algorithm 1. Similarly, if parallel updates are used for the input 
weight approximations, 5]^^) can be computed using the same Cholesky decompositions 

that are used to recompute q{y/k) in line 6 of Algorithm 1. In case sequential EP is used for q{y/k) 
in line 5 of Algorithm 1, vectors 1/^*, = ^w/z^w^ and determinant term log|I]v| can be updated 
simultaneously with the rank-1 updates of /liw^ and E^^. 

The EP approximation log Zgp has the appeaUng property that its partial derivatives with respect 
to the site parameters in their canonical forms (for example, X;jt, Vjjt, T,- yv = otjaj, /3,-, Te,,- = dg ^ 
and Ve,,- = Oq ■ fl^j) are zero when the algorithm has been iterated until convergence (Opper and 
Winther, 2005). This follows form the fact that the fixed points of the EP algorithm correspond 
to the stationary points of (48) with respect to the site parameters (or equivalently the cavity pa- 
rameters defined using constraints of the form x,- ,t = V[l^ — VZi\)- Thereby the marginal hkehhood 
approximation can be used in gradient-based estimation of hyperparameters such as 9, a^Q, q, or 
{<l^/}f=i in case they are not inferred within the EP framework for determining {^(W;t)}f^j and ^(v). 
Because the convergence of the hkehhood approximation can take many iterations it is advisable 
to initialize the hyperparameters to sensible values and run the EP algorithm once until sufficient 
convergence starting from zero initialization for the site parameters. After that gradient-based local 
update steps can be taken for the hyperparameter values by continuing the EP iterations from the 
previous site parameter values at each new hyperparameter configuration. 
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